Porting Unix To The 386 




Table of Contents 


Porting Unix To The 386: A Practical Approach 


Designing the software specification 


William Frederick Jolitz and Lynne Greer Jolitz 


Getting Started: References, Equipment, and Software 


Development of the 386BSD Specification 


The Definition of the 386BSD Specification 


Conflicts in Priorities 


386BSD Port Goals: A Practical Approach 


Microprocessor and System Specification Issues 


386 Memory Management Vitals 


Segmentation and 386BSD 


Kernel Linear Address Space Overhead 


Virtual Address Space Layout 


Per-Process Data Structures 


386 Virtual Memory Address Translation Mechanism 


User to Kernel Communication Primitives 


Berkeley UNIX Virtual Memory System Strate 


Structure of Per-Process Data (u.) 


Process Context Description 


Page Lault and Segmentation Lault Mechanism 


Other Processor Laults 


Figure 7: 386 processor exceptions that needed to be mapped into the kernel 


exception-handling mechanism 


Microprocessor Idiosyncrasies 


System Call Interface 


System Specific (ISA) Issues 


ISA Device Controllers 


ISA Device Auto Configuration 


Interrupt Priority Level Management 


Summary: Where is 386BSD Now? 


386 Segmentation and Pagin 


[FIGURE 6] 


[FIGURE 8] 


Figure 10: ISA device controllers: (a) data structures for configuring devices (b) sample 


[FIGURE 10b] 


Figure 3(b) 























Figure 3(a) 


Porting Unix To The 386: Three Initial Pc Utilities 


William Frederick Jolitz and Lynne Greer Jolitz 


The Purpose of Our PC Utilities 


The First PC Utility: boot.exe 


Example 1: Mock code that loads a GCC executable into memory 


The GCC Executable Format 




GATE A20 


The Second PC Utility: cpfs.exe 


The Third PC Utility: cpsw.exe 


Where We Go From Here 


The 386BSD Project and Berkeley UNIX 


[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


[LISTING FOUR] 


[LISTING FIVE] 


[LISTING SIX] 


[LISTING SEVEN] 


Porting Unix To The 386: The Standalone System 


Creating a protected-mode standalone C programming environment 


William Frederick Jolitz and Lynne Greer Jolitz 


The First Ste 


Introducing the Standalone System 


Keyboard Driver 


Prevaricating with the Standalone System 


Extending the Standalone System 


Processor Support — i386.c 


Initial Task State Load 


System Call Handlin 


Where Do We Go From Here? 


[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


[LISTING FOUR] 


[LISTING FIVE] 


[LISTING SIX] 





















































































[LISTING SEVEN] |. 

[LISTING EIGHT]|. 

[LISTING NINE] |. 

[LISTING TEN] |. 

[LISTING ELEVEN]. 

[LISTING TWELVE]. 

[LISTING THIRTEEN]. 

[LISTING FOURTEEN]. 

[LISTING FIFTEEN]!. . 

P orting Unix To The 386: Language Tools Cross Support .... 

D eveloping the initial utilities . 

William Frederick Jolitz and Lynne Greer Jolitz . 

Why Develop Cross-Tools? . 

386BSD Cross-Tools Goals . 

What’s in the To ol Chest? . 

ethodology . 

Which C Standard? . 

Other Cross-Support Issues . 

Validating GCC for Use in a Cross-Environment . 

GCC Support Calls to Replace GNULIB . 

Choosing a Sensible Cross-Host . 

Where Do We Go From Here . 

Brief Notes: Copyrights, Copylefts, and Competitive Advantage . 
Figure 1: The copyright used by TeleMuse in the 386BSD article series 

[LISTING ONE] |. 

[LISTING TWOjl. . 

Porting Unix To The 386: The Initial Root Filesysteml. 




William Frederick Jolitz and Lynne Greer Jolitz| . 

The Role of the Root Filesystem . 

A Brief Review of the Root . 

Example 1: The root directory generated by the UNIX Is command (Is -1), shows file attributes, 

link count, ownership, file size, modification date, and file name. . 

Installation: /stand . 

Booting: /boot and /vmunix . 

Initialization: /sbin/init, /dev/console, and /bin/sh . 

Utilities: /bin and /shin . 

nd /var . 

Other Directories: /lib, /mnt, /usr, /root, and /sys. 


mmmsm 


Filesystem Downloading 


What’s in a Filesystem? . 

Why Do We Need a Root Filesystem? . 

The Filesystem Metaphor and its Importance in Future Work 
P orting Unix To The 386: Research & The Commercial Sector 
Where does BSD fit in?|. 


65 

68 

70 

71 

71 

72 
76 

76 

77 
77 
77 

77 

78 

78 

79 

80 
81 
82 
82 
84 
84 

84 

85 

86 
87 

87 

88 
88 
88 

89 

90 

90 

90 

90 

91 

92 
92 

92 

93 
93 

93 

94 

94 

95 

96 
96 



































































































William Frederick Jolitz and Lynne Greer Jolitz 
The 386BSD Project and Berkeley UNIX . 

Porting Unix To The 386: A Stripped-down Kernel . 

Onto the initial utilities . 

William Frederick Jolitz and Lynne Greer Jolitz 
The Basic Structure of the UNIX Kernel 

Incremental Strategy . 

Composing the Basic Minimal UNIX Kernel 
How Can You Be in Two Places at Once...? 

UNIX as a Subroutine Call . 

Configuring the 386 for UNIX Operation . 

Segments Revisited . 

Interrupts and Exceptions . 

Summary: What Do We Have Now? .... 
Brief Notes: 386BSD Recursive Paging 

Figure 1(a): . 

Figure 1(b): . 

[LISTING ONE]. 

[LISTING TWO]. 

[LISTING THREE]. 

[LISTING FOUR] |. 

[LISTING FIVEll . 

Figure 2 . 

Figure 4(a) . 

Figure 3 . 

Figure 4(b) . 

Porting Unix To The 386: The Basic Kernel 

Overview and initialization . 

William Frederick Jolitz and Lynne Greer Jolitz 
Layered Modeling: Achieving a Well-Stacked System 

Top-Level Layers| . 

The Global Kernel Set . 

Process Private Set . 

The Proc Slot . 

Kernel Events . 

Machine-Independent Initialization .... 

Virtual Memory Subsystem . 

Kernel Memory Allocator . 

Device Startupl . 

Mounting the Root . 

Final Machine Initialization . 

Summary . 

386BSD Availability . 

[LISTING ONE]] . 

Figure 2 . 

Figure 2 . 

Figure 1 . 


96 

98 

99 
99 
99 

100 

101 

101 

102 

103 

104 
104 

104 

105 
105 

107 

108 
111 
114 
118 
119 
121 
122 
122 
124 
124 
126 
126 
126 
126 
128 
128 
128 

129 

130 

130 

131 

132 

132 

133 

133 

134 

134 

135 
138 

138 

139 


IV 



































































Porting Unix To The 386: The Basic Kernel: Multiprogramming and Multitasking, Part One 


Multiprogramming and Multitasking, Part One 


William Frederick Jolitz and Lynne Greer Jolitz 


What is Multiprogramming? 


Attempts at Multiprogramming: MS-DOS and Finder 


Conventions and Definitions 


Processes and Tasks 


Context Switchin 


Preemption and Multitaskm 


Time Slicin 


UNIX Organization for Multiprogrammin 


A UNIX Process’s Double Life 


Blocking and Unblocking Processes 


Process Context-Switching Mechanism 


Next Month 


References 


Brief Notes: Lightweight Processes and Threads 


Brief Notes: Is UNIX Real-Time Enough? 


Porting Unix To The 386: The Basic Kernel: Multiprogramming and Multitasking, Part II 


Multiprogramming and Multitasking, Part II 


William Frederick Jolitz and Lynne Greer Jolitz 


The 386BSD swtchQ Routine 


Where Is swtchQ Used? 


The Life Cycle of a Process 


The Magic of swtchQ: a Simple Scenario 


A Closer Look at swtchQ 


Alternative Implementations and Trade-offs 


Adding Multiprocessing to 386BSD 


Reflection: Why is it Flard to Add Multiprogramming After the Fact? 


Onward and Forward 


[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


[LISTING FOUR] 


[LISTING FIVE] 


Porting Unix To The 386: The Basic Kernel: Device autoconfiguration 




William Frederick Jolitz and Lynne Greer Jolitz 


Re-examining Our Framework: Kernel Services 


UNIX as the Device Driver Interface! • 

Then What is a Device Driver? . 

What are Drivers Made of?. 




Categories of Device Drivers 


V 





















































































BSD Autoconfiguration Goals 


BSD Autoconfiguration Approach 


Alternative Autoconfiguration Approaches 


When Autoconfiguration Comes into Play 


Information Required for Autoconfiguration 


How and Where to Find Devices 


Bulk I/O Facility Usage 


Device Characteristics 


Autoconfiguration and Disk Drive Labels 


Higher-level Autoconfiguration 


The ISA in a Nutshell 


I/O Ports 


I/O Display/Buffer/ROM Memory 


Direct Memory Access (DMA) 


386BSD Autoconfiguration Scheme 


386BSD Autoconfiguration Limitations 


Other Buses 


Next Time 


Brief Notes: UNIX—Just a Bag of Drivers? 


Porting Unix To The 386: Device Drivers 


Drivers for the basic kernel 


William Frederick Jolitz and Lynne Greer Jolitz 


Device Drivers Needed for the Basic Kernel 


Contents of the Device Driver: A Dumping Ground? 


Haven t We Met Somewhere Before? 


Display and Keyboard Driver 


Hard Disk Driver 


Clock Driver 


Normal Devices. 


XXintr(). 


Network Devices. 


Next Month 


[LISTING ONE] 


[LISTING TWO] 


Porting Unix To The 386: Device Drivers: Entering, exiting, and maskin 


Entering, exiting, and masking processor interrupts 


William Frederick Jolitz and Lynne Greer Jolitz 


Example 1: The spl function . 

386 ISA Interrupt Mechanism in Detail . 

Interrupt Group Masks . 

Wiring Interrupts to Drivers . 

The Interrupt Descriptor Table Entry!. 


VI 







































































[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


[LISTING FOUR] 


[LISTING FIVE] 


Porting Unix To The 386: Device Drivers: Getting into and out of interrupt routines 


Getting into and out of interrupt routines 


William Frederick Jolitz and Lynne Greer Jolitz 


UNIX as Interface—Is it Adequate? 


[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


Porting Unix To The 386: Missing Pieces, Part 1 


Completing the 386BSD kernel 


William Frederick Jolitz and Lynne Greer Jolitz 


386BSD Kernel Completion Methodolo 


386BSD Resource Lists 


Resource Lists Defined 


Program Execution Function 


Executable File Format 


Next Month 


[LISTING ONE] 


[LISTING TWO] 


[LISTING THREE] 


Porting Unix To The 386: Missing Pieces, Part Ii 


Completing the 386BSD kernel 


William Frederick Jolitz and Lynne Greer Jolitz 


Leveraging the POSIX Definition 


Choices of Implementation 


execveQ 


File Validation 


Executable Format Recognition and Consistency Checkin 


Reading and Processing Argument Strings 


Building a New Process Image 


Preparing the New Image for Execution 


What’s Not Finished with execve? 


Block I/O Cache 


4.4BSD Demands 


4.4BSD Weaknesses 


Character-by-Character Operations 


386BSD: Other Portions Beyond Basic Operation 


Lessons Learned 


[LISTING ONE] 


VII 



















































































224 

231 

234 

234 

234 

234 

235 

235 

236 

236 

237 

237 

238 

238 

239 

239 

240 






































Porting Unix To The 386: A Practical Approach 

In this first installment of a multipart series, the design specification for 386BSD, Berkeley UNIX for the 
80386, is discussed. 

Designing the software specification 

William Frederick Jolitz and Lynne Greer Jolitz 

Prior to leading the 386BSD project. Bill was the founder and CEO of Symmetric Computer Systems, a 
BSD-based workstation and networking products manufacturer. He was the principal developer of 2.8 and 
2.9 BSD and the chief architect of National Semiconductor’s GENIX project, the first virtual memory 
microprocessor-based UNIX system. Prior to establishing TeleMuse, a market research firm, Lynne was 
vice president of marketing at Symmetric Computer Systems. She has produced white papers on strategic 
topics for the telecommunications, electronics, and power industries. Bill and Lynne conducts seminars on 
BSD, ISDN, and TCP/IP, and are in the process of producing a book on 386BSD and a textbook focusing 
on the applications layer of the In ternet Protocol Suite. They can by con tacted via e-mail at 
william@berkeley.edu or at uunet!william. Copyright (c) 1990 TeleMuse. 

The University of California’s Berkeley Software Distribution (BSD) has been the catalyst for much of the 
innovative work done with the UNIX operating system in both the research and commercial sectors. 
Encompassing over 150 Mbytes (and growing) of cutting-edge operating systems, networking, and 
applications software, BSD is a fully functional and nonproprietary complete operating systems software 
distribution (see Figure 1 ). In fact, every version of UNIX available from every vendor contains at least 
some Berkeley UNIX code, particularly in the areas of filesystems and networking technologies. 

However, unless one could pay the high cost of site licenses and equipment, access to this software was 
simply not within the means of most individual programmers and smaller research groups. 

The 386BSD project was established in the summer of 1989 for the specific purpose of porting BSD to the 
Intel 80386 microprocessor platform so that the tools this software offers can be made available to any 
programmer or research group with a 386 PC. In coordination with the Computer Systems Research 
Group (CSRG) at the University of California at Berkeley, we successively ported a basic research system 
to a common AT class machine (see, Figure 2 ), with the result that approximately 65 percent of all 32-bit 
systems could immediately make use of this new definition of UNIX. We have been refining and 
improving this base port ever since. 

By providing the base 386BSD port to CSRG, our hope is to foster new interest in Berkeley UNIX 
technology and to speed its acceptance and use worldwide. We hope to see those interested in this 
technology build on it in both commercial and noncommercial ventures. 

In this and following articles, we will examine the key aspects of software, strategy, and experience that 
encompassed a project of this magnitude. We intend to explore the process of the 386BSD port, while 
learning to effectively exploit features of the 386 architecture for use with an advanced operating system. 
We also intend to outline some of the tradeoffs in implementation goals which must be periodically 
reexamined. Finally, we will highlight extensions which remain for future work, perhaps to be done by 
some of you reading this article today. Note that we are assuming familiarity with UNIX, its concepts and 
structures, and the basic functions of the 386, so we will not present exhaustive coverage of these areas. 


1 





In this installment, we discuss the beginning of our project and the initial framework that guided our 
efforts, in particular, the development of the 386BSD specification. Future articles will address specific 
topics of interest and actual nonproprietary code fragments used in 386BSD. Among the future areas to be 
covered are: 

• 386BSD process context switching 

• Executing the first 386BSD process on the PC 

• 386BSD kernel interrupt and exception handling 

• 386BSD INTERNET networking 

• ISA device drivers and system support 

• 386BSD bootstrap process 

Getting Started: References, Equipment, and Software 

Most software ports begin with the naive assumption that the UNIX kernel is merely a C program with a 
handful of functions, supporting other utility C programs on demand. While in essence this is true, in 
practice this is a vast oversimplification. Nevertheless, in the tradition of great projects, we acquired a few 
tools and other items before getting down to work: 

• The Design and Implementation of the 4.3BSD UNIX Operating System by Leffler, McKusick, 
Karels, and Quarterman (Addison-Wesley, 1989) and Programming the 80386 by Crawford and 
Gelsinger (Sybex, 1987) were purchased from a bookstore in Berkeley. Since no one on our team 
possessed any extensive technical background on either the 386 or the IBM PC, the 80386 book was 
our sole resource for the microprocessor. The 4.3BSD book illuminated some of the obscure areas 
and requirements of the BSD UNIX operating systems kernel. We highly recommend these books. 
Both books have become somewhat shopworn during the process — the 80386 book has had it’s 
covers taped twice, primarily due to being thrown repeatedly across the room in the general direction 
of the trash can. This book, while the best resource available on the subject, is not as complete as one 
might hope, primarily because the 80386 is a complex animal and is enigmatic in the correct use of 
its many features. Segmentation exception handling descriptions should not be taken literally, 
although the book was of great value when writing the first versions of exception handling code. 
Some portions of the software were even determined empirically. (Intel was not eager to provide any 
information.) The single biggest problem encountered in our project was that of inadequate 80386 
documentation. 

• A completely blank, inexpensive standard 386 AT clone was the selected hardware platform. To 
minimize expenses and to emphasize commonality, we chose to support only the basic 386 platform. 

• Using exploratory programs written in Borland’s Turbo C, we were able to explore the typical AT 
hardware. These exercises permitted us to better understand the information contained in IBM’s 
Technical Reference Guide Personal Computer AT, a classic if not obscure work. We then tested the 
mechanisms inside the AT to make certain we knew what must be provided in order to generate the 
necessary software driver support for BSD UNIX. 

• Our initial kernel source was the 4.3BSD Tahoe release (available for an obscure machine, the CCI 
Power 6/32, and as similar to the 386 as a can opener), at that time the most stable and recent release. 


2 



All of these references and the equipment were examined prior to generating even the first line of code. 

An understanding of the architectures of the hardware and software is critical to developing an appropriate 
386BSD specification. Thus, we were able to ensure a successful port, even when unanticipated problems 
arose. 

Development of the 386BSD Specification 

Once all the materials were gathered, the temptation was to immediately sit at the PC and write code. This 
is a temptation that should always be vigorously avoided. One needs to sit down and carefully break down 
this project into smaller bites. However, because many parts of this project are interrelated, we must insure 
that the internal standards are uniformly maintained by all areas of the port and during all phases. In other 
words, the bridge must meet in the center. 

Therefore, instead of plunging directly into development, we began the most critical phase of this (or any) 
port — that of creating the 386BSD specification. This specification addressed the following major issues: 

• Segmentation and paging 

• Virtual and physical address space 

• Process context description 

• System call interfaces 

• ISA device requirements 

• Microprocessor idiosyncrasies 

• Bootstrap 

Unlike a commercial specification, the 386BSD specification was intended to be lightweight and flexible. 
We wanted to focus 386BSD without making the specification a major work in itself. We also knew that 
many of the finer points would change as we got closer to our goal. 

The Definition of the 386BSD Specification 

At first glance, the choice of the 386 microprocessor and ISA system architecture appears to define the 
operating system’s machine-dependent requirements. For example, on the original 8088 PC to the present, 
MS-DOS would use the software interrupt INT $XX instruction to dispatch through the interrupt vector 
table entry XX, and then dispatch to the desired system call inside MS-DOS. This way the only was 
application programs could call the operating system. 

Had this regularity been true for the UNIX operating system, all 80x86 UNIX systems would be alike, and 
the development of a specification would be a simple task. However, in exploiting the power and 
flexibility of UNIX, one is faced with a grander specification. The kernel architect is now faced with 
competing alternatives. With UNIX, the choices are no longer "cut and dried." 

Adding to this dilemma, the 386 is at least two generations beyond its simple ancestor. The enhanced 
features the 386 now offers allow us many competing ways to satisfy a UNIX system design. Continuing 
our example, instead of using the INT $XX instruction, we can use the intersegment LCALL instruction to 
call the operating system through call gate segments. We can use some powerful features of the 386, but at 
the cost of a more elaborate mechanism. Is it worth it? 


3 



In this case, the LCALL instruction can be used to support reverse compatibility with other versions of 
UNIX in the form of an applications package rather than within the operating systems kernel, and thus 
may be worth the effort. However, choosing among the myriad, often conflicting, alternatives is typically 
a task fraught with peril. 

For the 386BSD project, we first determined our priorities: 100 percent BSD kernel and user functionality. 
The system must contain all important underlying mechanisms of the Berkeley UNIX system. Any 
extensive modification pertaining to how Berkeley UNIX functions on other extant platforms can result in 
incompatibility. Incompatibility is like an imitating insect that bites in many places — and tends to lay 
hidden until after extensive distribution. As such, we did not exploit some features of the 386, such as its 
elaborate segmented architecture, at the expense of incompatibility. 

Efficient use of the native processor architecture. We would like to use the system in ways to obtain the 
highest performance and greatest functionality possible. 

Interoperability with existing commercial standards. We would like to use the system in ways which 
maintains compliance with extant commercial standards. We do not intend to unnecessarily create 
arbitrary new standards if current standards are acceptable. 

Rapid implementation of the basic operating system. One maxim of any UNIX development effort is "the 
best tool to build a UNIX system IS a UNIX system." We needed to bootstrap ourselves rapidly into 
operation and leverage 386BSD itself to complete the project. 

Conflicts in Priorities 

These basic priorities inherently conflict. For example, BSD systems have basic incompatibilities with the 
AT&T System 5 UNIX systems, because each project has firm interests and no compelling need to 
cooperate. As such, perfect compatibility is impossible to achieve given our project focus. The opposite 
tact, no compatibility constraints, is also not completely acceptable, because we are dealing with the PC 
class of computers and not minicomputers or workstations. Fine grain differences also exist among the 
many standards currently competing for favor in the world of 386 UNIX systems. 

386BSD Port Goals: A Practical Approach 

Given all of these trade-offs, we decided to take what we call a "practical" approach to 386BSD. We 
concentrated primarily on "hard adherence" to both BSD operability and high-performance 
implementation, for the simple reason that 386BSD is a research project intended for use by the research 
community. However, because even this audience depends on commercial resources, we decided to invest 
some of our effort in the development of a few fundamental areas such as System Call Interface 
Definition. 

By dealing with these basic areas, we allowed for limited adherence to commercial standards from the 
start, with the ability to gradually extend 386BSD as needed. (For example, in future releases we hope to 
offer some degree of support for segmentation and VM8086 mode.) We have also tried, when possible, to 
conform to the spirit of the 386 Application Binary Interface (ABI) and its predecessor Binary Compatible 
Standard (BCS) when they did not conflict with our adherence to Berkeley UNIX. 


4 



Some may take issue with this stance, seeing binary compatibility standards entirely as an "all or nothing" 
issue. Those who spend a great deal of time arguing over the big end and the little end of the ABI egg are 
usually involved in maintaining control over the shrink-wrap commercial software market. However, 
those who wish to ignore the ABI juggernaut are also ignoring the largest body of UNIX software outside 
the research community. In this case, ignorance is simply a mask for arrogance. As we stated earlier, we 
have tried to take a "practical" approach that builds in the flexibility without altering the scope of our 
project. 

Many people wonder why UNIX systems are so big and complex. A look through any UNIX kernel can 
quickly answer this question. Many different groups prefer to further standard agenda b claiming a piece 
of the kernel for their own use, instead of redesigning it for common support or moving things out of it 
that really belong in an application process. SVR4 alone is rumored to contain 14 different filesystems 
which are just a variation on a theme. This "Chinese menu" approach to kernel design has resulted in a 
bloated kernel that is difficult to enhance or maintain. Because standards by accumulation just don't work, 
with 386BSD we strive to avoid such nonsense. 

Another goal of our project was to insure that all code developed for the 386-specific portions of this 
project be unique and novel. This is to prevent any particular commercial agent from arbitrarily 
appropriating, monopolizing, or prohibiting discussion and distribution of this code. This is the major 
reason why we are able to examine some of the interesting mechanisms of 386BSD without the censorious 
effect of proprietary license agreements. 

Microprocessor and System Specification Issues 

Our specification required that we break it down into two basic technical areas: the microprocessor itself 
and the surrounding system hardware. In keeping with our goals, we segregated the two in order to allow 
future support for other buses (such as EISA and Micro Channel) and to avoid obscuring microprocessor 
issues. 


The microprocessor required much delineation in the areas of segment and paging strategies, virtual 
memory allocation and other memory management issues; communications primitives, context switching, 
faults, and the system call interface. We also had to factor in microprocessor idiosyncrasies and bugs as 
we went along. On the system side, we concentrated on ISA bus considerations. 

We first outline some of the major issues revolving around the 386 microprocessor itself and how they 
relate to a Berkeley UNIX port. 


386 Memory Management Vitals 


Most popular microprocessor use either segmentation or paging to manage memory address space access. 
The 386 is rare in that it possesses both. In fact, since segmentation, ( Figure 3(a) ), is placed on top of 
paging ( Figure 3(b) ), you are expected to use segmentation in some form any time memory is paged. 
And, most important, BSD relies on paging. 


More Details. 


5 




All operand references on the 386 are tied to one of the segment registers. This segment register uses a 
16-bit selector (low-order bits determine level of access) to find a descriptor. This descriptor then 
determines the location of underlying memory in linear address space. When segmentation alone is 
enabled (also known as protected mode), the linear address space corresponds to the physical address of 
the selected segment for the operand. However, when paging is implemented, the 1 i near address space 
address must be run through a two-level paging mechanism to find the physical page frame number, the 
actual address of physical memory underneath the virtual address. 

One of the most powerful, yet confusing, features of the 386 is its segmented architecture. While the 
current trend in microprocessors has been oriented towards a single "flat" linear virtual address space, the 
386 has continued the bias toward segments held by the entire 80x86 line. The two most important 
changes in the 386 from previous versions — permitting 32-bit operations and expanding segments from 
64 Kbytes to 4 gigabytes in size — may turn some of the inherent disadvantages of 80x86 segments into an 
advantage. Segments once too small for many data items (such as arrays of real numbers) can now utilize 
alternative address spaces. This is of great interest to those working with specialized applications, such as 
3-D to 2-D transformations. 

Segmentation and 386BSD 

UNIX was initially developed on machines that relied on linear virtual address spaces. As such, Berkeley 
UNIX provides no support for segments and instead expects a large linear virtual address space for both 
kernel and user. In fact, UNIX in general adapts to segments only under duress. 

Originally, we had intended to use segments in a straight-forward manner. However, we found that would 
result in a host of nuisance problems. For example, many programs (debuggers, assemblers, and 
object-linking editors) must be modified so that separate address spaces for the various regions could be 
maintained. Object file format, always in a state of flux due to the varying degrees of dynamic loading of 
instruction and data structures, would require change. 

Another problem which arises when using segments is that the shared data in the instruction segment 
requires strict typing in the assembler (we force instructions to reference the CS segment directly) to 
obtain access. Because some compilers put data constants in the code area with the intent of sharing 
memory used by other processes, invoking segments would create little problems everywhere for the 
compiler. 

Still other problems result from the use of string instructions on stack resident data and that time honored 
bad practice known as self-modifying code. The key flaw in all these cases is that the binding to the 
particular' segment register is mandated by the assembler, and cannot be properly resolved by the object 
code linker as other symbols are normally handled. 

Given all of the problems which arose and, in accordance with our 386BSD goals, we chose to minimize 
support for segmentation by running the machine in "flat" mode. As a result, no tinkering with object file 
format or tools was required. An amusing side effect of this approach is that it allowed us to cross-develop 
386 code on VAX and NSC32000-based computers using the native object utilities. This choice 
minimized bookkeeping considerably but also ultimately defeated the purpose of segments. A more 
elaborate design was beyond the scope of our project. 


6 



Kernel Linear Address Space Overhead 


The kernel, as well as the user mode programs, requires its own set of segment registers. If the kernel is 
called, its segments must be present. This takes up precious linear address space. Thus, we can never run a 
process exactly 4 gigabytes in size because a portion of the address space must be reserved for kernel use. 
Even if we try to use segments to relocate the kernel, we cannot escape the limit — it not only takes up the 
same linear address space but also forces us to use intersegmental instructions to communicate data 
between user process and the kernel. Since the user, the process, and the kernel must share virtual address 
space, we limited ourselves to a maximum process size of 4 gigabytes less the kernel size. 


The kernel segment registers are outlined in |Figure 4| . These segment registers cover (alias) the user 
segments and allow access to the user space from the kernel in any way desired (read, write, or execute). 
Because all segments start at zero, the kernel begins at a high address (or offset) and always runs 
relocated. In 386BSD, the code segment just covers the end of the kernel instruction region, because no 
self-modifying code was needed. 


One way to avoid linear address space sharing constraints is to have all interrupts, traps, exceptions, and 
system calls internally context switch to a separate process to execute UNIX system functions, using the 
386 trap with task switching feature. This unique 386 hardware allows traps to be handled by either 
procedures or tasks. However, task switching is very expensive and the system would context switch 
thousands of times more frequently than otherwise. Also, the UNIX kernel is not intended to run itself as a 
process, as use of this feature would require. 


Virtual Address Space Layout 

Within the 4 gigabytes per process address space, a process must be allocated regions for instruction, data, 
and stack for both user programs and the kernel. Some of these regions (user data, user stack) must grow 
as a process runs, and support must be available for additional regions used for shared memory and 
shared/dynamically loaded libraries. The size of these regions and their placement becomes an important 
consideration for any UNIX port. 

The traditional UNIX approach is to place the instruction region at the beginning of the address space, 
followed by data, unused space, and finally a stack region. The purpose of the empty space is to build in 
room so that the stack can grow down and the data (for heap storage) can grow up. The end-point is 
known in UNIX vernacular as the "break." Usually, text starts at absolute virtual address 0. 

A problem common with UNIX systems arose from the extensive use of uninitialized string pointers, 
which by default were set to the value 0. Because the first word at address 0 was also set to 0, this meant 
that null pointers always pointed to null strings. However, many early computers did not permit the 
bottom of address space to be used in this way and a tested program would abort. UNIX code that was 
thought "proven" on the PDP-11 and VAX was actually masked by the development system architecture. 
Eventually, many uninitialized pointers were located and corrected. Some versions of UNIX also leave the 
very bottom and top of address space unmapped to catch in directions through 0 and -1. This method is of 
limited effectiveness, however, if a structure referenced through such a pointer is bigger than the size of 
the bottom and top address space holes. 


7 




386BSD virtual address space is arranged in the traditional manner (see |Figure 5| ). The user address space 
begins at zero with text, (yes, we do indeed have 0 at location 0), followed by data, unused space, and 
finally the stack. The start of the user stack, located at the top of the user’s address space, is not fixed. (A 
future project may utilize this feature to "lower" the stack, providing room for dynamically created 
regions.) Because only the operating system needs to know the exact location of the user stack, it assigns 
the stack's address space on process program load (exec system call). 

Per-Process Data Structures 

The kernel address space resides above the user portion of the process virtual address space. By virtue of 
being co-resident in the virtual address space with the user space (a somewhat mandatory virtue), the 
kernel can directly reference any part of the current running user process in the lower portion of memory. 

As in the user space (and in UNIX executable files), kernel instructions and data are arranged 
consecutively. The stack and a new special region, the per-process data structure or user structure (u. for 
short), appear below the kernel. One advantage of this arrangement is that it becomes possible to share all 
portions of the page tables for address space above the kernel base address. Notice that through this is a 
vital part of the kernel, it is technically at the very top of user address space and is purposely left readable 
by the user process. Everything beneath the system base address is switched when a context switch to the 
next process occurs. 

Currently, the kernel address space starts at virtual address OxfeOOOOOO, and allows up to 32 Mbyte of 
address to be reserved for use within the kernel. This boundary can be moved at a later date if more 
address space is needed. 

Access of the ISA bus device memory (screen and LAN buffers) is obtained through an allocated region of 
the kernel memory, known as a utility page map. This is similar to portions of on-demand physical 
memory used by the kernel through other utility page maps. The kernel also has a variety of data structures 
scaled and allocated at boot time (valloc) and a heap for dynamic demands (malloc). 

386 Virtual Memory Address Translation Mechanism 

The 386 paging mechanism impacts the 386BSD specification with respect to address space allocation 
constants: Each page is 4K byte in size and must reflect the minimum granularity of address space 
allocation, while each page of page tables maps 4 Mbyte of address space. These constants determine 
address boundaries used to allocate memory and share address space between similar processes. Shared 
objects starting on 4-Mbyte boundaries can share page tables as well as underlying physical memory. 

Page size granularity is important to the layout of executable files. Instruction and data regions are 
arranged into discrete and aligned memory page units, so that it is possible to demand load pages that may 
be either "read-only" (instructions) or "read-write" (data or stack). The page table size granularity is 
typically located at the beginning of each user, user stack, and kernel address space. It is possible to share 
these among many processes, obviating the need for separate page tables. As a result, while each process 
has its own page table directory, the top eight PDEs of each process page table directory point to the same 
kernel page tables. Thus, the kernel’s portion of address space is global to all other processes. 


8 




User to Kernel Communication Primitives 


By arranging our address space as outlined, we’ve greatly simplified the routines that communicate 
between kernel and user process (now the kernel routines can directly access user space). All that is 
needed is a way to determine if a selected portion of user memory may be read or written before it is 
attempted. On some machines (such as the VAX) special instructions are available for this purpose. The 
386, however, offers instructions only for use in validating segments, not pages. So we must use a 
different strategy. 

In 386BSD, we chose to set a global variable (nofault) to a nonzero value. If a fault happens during any 
user/kernel communication primitive, it transfers to the address held within no fault. In this way we can 
catch illegal references by using the microprocessor’s own address translation mechanism to find them, 
instead of by tedious code evaluation on every reference. 

Unfortunately, one idiosyncrasy of the 386 now rears its ugly head. The designers of the 386 decided that 
segment attributes should be used to ultimately determine access to regions in a process, thus making their 
use mandatory in the system even if we don't need them. To be precise, we have page attribute bits that 
can be used for protection. These work as expected, unless the 386 is run in supervisor mode (as does the 
kernel). In this case, only the valid/invalid attribute has any effect. This nuisance or "feature" requires a bit 
of workaround to make the primitives complete. 

Berkeley UNIX Virtual Memory System Strategy 

The current Berkeley UNIX virtual memory management subsystem was originally designed for use with 
a VAX, and as such has no support for page directories. For that matter, the 386 doesn’t know of such 
VAX concepts as PO and PI address spaces for instruction/data and stack nor of page table-length 
registers. Currently, these are simulated in 386BSD. However, work is underway to revise the entire 
virtual memory system to permit more generalized operation over all supported Berkeley UNIX platforms, 
now that the demands of each platform have been made obvious. 

Portions of the VAX were simulated by employing code, written by Mike Hibler at the University of Utah, 
which supports the 68030 paging memory management. Because the 386 code is so similar, we used a 
conditional compilation that shares 68030 and 386 versions interchangeably — an odd couple indeed. 

Structure of Per-Process Data (u.) 

Within each process accessed by the kernel exists a unique data structure containing the private variables 
of the process used to provide UNIX system call functionality. This is called the "extended state" of a 
given process and is collected into one location. If the process is long inactive, this state is swapped to 
secondary storage to reclaim RAM memory. All of the machine-dependent fields in this structure lie 
within the first element u_pcb, a process context descriptor. However, the size of this structure and its 
adjoining kernel stack is also a machine-dependent parameter. The u. is currently defined at about 1 Kbyte 
in size. This fits amply within a single page. 

Another page is sufficient to hold a kernel stack. This results in a per-process data structure two pages in 
size. By leaving these as two separate pages in 386BSD, instead of combining them into a single page 
(giving us a smaller kernel stack), the kernel stack segment can be used to catch the stack overflow 
("redstack") condition. This will appear as a future enhancement. 


9 



Process Context Description 

As seen in Figure 6| , the process control block (struct pcb), contains the 386-specific per-process 
information. This is broken down into hardware-dependent fields and software-related fields. The process 
control block is place at the front of the user structure so that the information can be reloaded from the 
address of the user structure and force active a previously inactive process. The user structure address is 
recorded in the process table. Each entry describes global information about a process. 

The 386’s hardware context switch facility can be used to switch from process to process. By placing the 
hardware-dependent information at the beginning of the process control block, in the form of the 386’s 
Task Switch State (TSS) data structure, it is possible to switch from one process to another with a single 
intersegment ljmp instruction to the appropriate task gate selector. While this feature has been 
implemented in 386BSD, it is not used at this time for switching between processes due to performance 
considerations. However, it can be used in other cases, such as exception handling, and we may elect to 
use it for process switching in the future. We view this as one of those rare "have your cake and eat it too" 
decisions. 

In 386BSD, not all hardware context is switched in this manner, because some processes never access the 
large amount of state information (108 bytes) used by the numeric coprocessor. We allow for this with the 
pcb_fpusav structure. Other fields correspond to some implementation demands specific to Berkeley 
UNIX, including simulating VAX hardware constructs invoked by the virtual memory system not existing 
on the 386. Fortunately, this was a small amount of code. It is a tribute to the concept of UNIX that the 
machine-dependent portion of the system is as small as it is. 

Page Fault and Segmentation Fault Mechanism 

To report exceptions that occur in the 386 memory management hardware, they must be caught and routed 
to the proper portion of the kernel. UNIX places these exceptions in two categories: Faults signaled to the 
user process, which terminates the process if it is not interested in the exception, and "resource not 
present" faults sent to the virtual memory system to request a missing page. 

The 386 also signals a variety of segment exceptions, almost all of which result in dire consequences for 
the process that invokes them. A single page fault exception encodes both "page not present" as well as 
"protection violation" events. These page faults, along with the fault address, are recorded in processor 
special register cr2 and should be carefully examined to determine the precise nature of each exception. 

Other Processor Faults 

Along with address space faults, we found we must map 15 other faults (see Figure 7 ) into the Berkeley 
UNIX kernel exception-handling mechanisms. The numeric coprocessor presents special fault-handling 
challenges, for it can be operating when 386BSD switches to another unrelated process. In that case, we 
can get a trap that should have been passed to a process other than the one currently running. 


10 




Figure 7: 386 processor exceptions that needed to be mapped into the kernel 
exception-handling mechanism 

386 Processor Exceptions 


Exception 


Description 


Pushes an Error Code? 


Divide 

Debug/Trace 

Breakpoint 

Overflow 

Bounds Check 

Illegal Instruction 

NPX DNA 

Double Fault 
NPX Operand 
Invalid TSS 
Segment Not Present 
Stack Segment 
General Protection 
Page 

NPX Error 


Division by 0 or division 
overflow 

Single step or debug hardware 
condition 

Executed an INT3 instruction 
Executed an INTO instruction 
when OF bit set 
Executed an BOUND instruction 
which failed 
Executed an unknown 
instruction 

Numeric processor device not 
available 

Recursive fault (fault while 
processing a fault) 

Numeric processor accessed 
outside of segment 
Attempted to task switch to 
incorrect task state 
Attempt to access a not 
present descriptor 
Problem with current stack 
descriptor 

Protection problem with a 
segment descriptor 
Page missing or protection 
problem with address 
Numeric processor signals 
an error 


No 

No 

No 

No 

No 

No 

No 

Yes 

No 

Yes 

Yes 

Yes 

Yes 

Yes 

No 


If 386BSD receives an unexpected fault while running in the kernel, it must immediately force the kernel 
down (in UNIX vernacular, to "panic") and attempt to save as much state information as possible for 
diagnostic purposes. Thus, we differentiated user traps from kernel traps. In most other microprocessors, a 
bit in the processor flags or status word determines if we are running in the kernel, but the 386 offers no 
such bit. So, 386BSD examines the contents of the CS segment register when a trap occurs (this is saved 
by the hardware during an exception) to determine if an instruction was executing in user mode. 

Microprocessor Idiosyncrasies 

We found a hornet’s nest of microprocessor idiosyncrasies unique to a 386 UNIX port. Some of the 
primary issues these touched upon included that of switching from real mode (20-bit addressing) to 
protected mode (32-bit addressing), creating segment descriptors to fill the interrupt descriptor table, 
creating other segments for use by the user and kernel modes of a process, and finally, novel suprises 
between different steppings of the 386/486 themselves. 


11 





One major irritant was the need for at least one TSS structure to be present at any time, even if we didn't 
use a TSS for task switching. The TSS records the contents of the kernel’s stack pointer for use when the 
kernel is reentered from user mode (interrupt, exception, and system call). Our early versions of 386BSD 
worked well as it started up within the kernel, moved into user mode for the first process, and then froze 
after hitting the first system call. Imagine our surprise when we found that, in effect, it had no place to 
save where it was coming from on the kernel stack! 

System Call Interface 

A table of system calls is provided by Berkeley UNIX with the assigned index number that differentiates 
them. This table specifies, in part, a binary standard for system calls — in this case, of a POSIX-based 
system. Of course, because POSIX is considered an "object library" definition (as opposed to the 
regulation at the system call level desired by ABI and BCS advocates), one might accurately consider this 
an "academic" standard. In deference to these other standards, however, we chose to accept their suggested 
format for system calls. 

Figure 8 is a code template for the system call stub used in 386BSD, in this case a write system call. The 
lcall instruction is an intersegmental call instruction that references a special segment selector, known to 
be a UNIX system call gate into the kernel. The selector corresponds to the first descriptor in the processes 
local descriptor table. To designate which system call is to be used, the eax register is loaded with the 
index from the table. Arguments for each system call are present on the stack, and this stub is called from 
another procedure. System calls return after the lcall instruction, returning values in the eax and edx 
registers (just as other C procedures do). System calls report failure by setting the carry bit and recording 
eiTor notification in eax. 

System Specific (ISA) Issues 

So far, we have only described issues relating to our choice of microprocessor. But this specification is 
incomplete unless the issues relating to the bus and the system surrounding the microprocessor are 
examined. We recognized that the 386 already operates on a plethora of different buses, including ISA, 
EISA, MCA, VME, and MULTIBUS, and that these issues vary depending on which bus is used. We may 
even need to support more than one bus at a time, or even a custom bus. As such, we decided that 386BSD 
must take into consideration the support requirements of many different bus standards. 

Physical Memory Map 

The ISA bus physical memory layout is outlined in Figure ~9l . The memory is broken into three parts: base 
memory, I/O device memory, and extended memory. RAM is split up on this standard, with a base 
memory section, holding up to 640 Kbyte of memory, starting at address 0 and ending at the beginning of 
device memory. Remaining memory is located stalling at address 0x100000 (above 1 Mbyte) and 
extending to as much as OxFFFFFF (16 Mbytes). 

Between the base and extended RAM regions lies device memory, where display adapter cards and LAN 
cards use special RAM buffers. This region, called the "hole," is a nuisance for UNIX ports, because we 
would rather see contiguous memory. Although we do have a means of making memory appear 
contiguous through the use of virtual memory, this does us no good when we must work with physical 
memory addresses during system bootstrap, hardware DMA devices, and physical memory allocation 


12 




structures. 


If extended memory is not available, we must temporarily reside in the MS-DOS 640-Kbyte base-memory 
dungeon. This is truly hell for memory-consumptive UNIX systems. Fortunately, this occurs only when 
the system is "misconfigured" during the configuration or boot processes, and is not a "normal" situation. 

ISA Device Controllers 

To support common ISA devices, 386BSD must cope with a separate I/O address bus, shared memory, 
vectored interrupts, and dedicated DMA controllers. Since most of these evolved from ad hoc standards, 
device conflicts are common. In order to accurately support ISA, we began with a minimal AT 386 
configuration — 386/387, 1-Mbyte RAM, keyboard, monitor, Winchester drive (ST506, ESDI, IDE), and 
floppy drive — and relied solely on what the BIOS uses to work the hardware. We expect an improvement 
in performance when these guidelines are eventually relaxed. 

ISA Device Auto Configuration 

A key advantage of Berkeley UNIX is its ability to configure at boot time devices present on the system. 
This feature, while difficult to implement on the ISA given numerous conflicts, was considered valuable 
and was implemented. 

In Figure 10(a)| , we have data structures that encode all the appropriate information to configure a device 
in 386BSD. Each driver, which may have many devices, is able to locate and configure a device if present. 
The isa_device structure also contains the characteristics of each device to be recognized. If found, 
hardware resources can then be assigned to each device as configured. A sample table of possible devices 
to search for within the kernel appears in Figure 10(b) . 

Interrupt Priority Level Management 

In the PC architecture, there is a separate interrupt level per device interrupt. These are more levels than 
traditional UNIX wants or needs. Instead, UNIX groups different classes of devices into interrupt priority 
levels that can be disabled and enabled as a group (disks, terminals, network). This is done through spl() 
function calls, named for a PDP-11/45 instruction which implemented this feature on early UNIX systems. 
This capability must be provided in 386BSD as well. 

Each interrupt vector (interrupt gate) has code that saves the cpl (current priority level) variable on the 
stack, sets the new cpl value, and turns on interrupts above this level. On return from the interrupt, all 
vectors call a common routine that disables interrupts, restores the cpl, and returns with interrupts enabled. 
The cpl is altered, as is the priority mask of the dual 8259 ICUs, by the spl() subroutines. This 
micro-processor or system can now be run at different priority levels on demand. 

Bootstrap Operation 

One of the last considerations in the development of the 386BSD specification is deciding how we can 
most easily bootstrap load the BSD kernel from hard or floppy disk. We know that ISA machines have 
BIOS ROMs that select the device to be booted (typically the floppy first, followed by the hard disk), load 
the very first block into RAM at location 0x7c00, and finally execute it in real mode. From this point on, 
we had to create some tight code to run within that 512-byte block to read in our kernel from an 


13 



executable file in the UNIX file system. 


Traditional Berkeley UNIX undergoes a four-step bootstrap process to load in the kernel. First, the initial 
block bootstrap is brought in from disk by the hardware (in this case, the BIOS). The primary purpose of 
this assembly language bootstrap is to load in the second 7.5-Kbyte bootstrap located immediately after 
the initial boot on disk. This larger program, written in C, is much more elaborate in that it can decipher 
the UNIX file system, extract the UNIX file /boot, and load it as the next stage in the bootstrap, /boot, the 
most complex of the three bootstraps, evaluates the boot event and finally passes configuration parameters 
to the kernel as it is loading /vmunix, also located in the file system. 

At first we intended to write the initial block bootstrap in MASM, Microsoft’s MS-DOS assembler, and 
use calls to the BIOS to accomplish the boot process. This proved to be unsatisfactory, as it still left us tied 
to MS-DOS. So, we decided to use the UNIX protected mode assembler. This allowed us to "cut the cord" 
with MS-DOS and permitted the system alone to support all code. We also chose to create drivers for the 
hardware directly, from the initial boot block on up, to break away from the BIOS as well. As a result, 
386BSD can now be easily retargeted to new buses that might not rely on either MS-DOS or the BIOS. 

Both the second and third bootstraps are actually separate incarnations of the same source code (drivers 
and all). The only difference is that the second bootstrap is a functional subset of the third bootstrap, so 
that it could fit within the small confines required. All of the bootstraps reference a special data structure 
called the disklabel that knows the layout and geometry of the disk drive booted. In this way thousands of 
different disk drives can be supported independent of MS-DOS and the BIOS information. 

Summary: Where is 386BSD Now? 

Perhaps the discussion of some of these issues might have seemed difficult or incomplete, but we found 
each item to be of tremendous importance in understanding the practice of a port to the 386 architecture. 
Unlike Berkeley UNIX ports to other systems, we found that we had to bend over backwards dealing with 
segments, memory issues, device issues, and a plethora of unique microprocessor features. Now, one may 
ask, was it all worth it? 

Well, BSD is now available on the 386 platform. Even though it is only a preliminary release, we already 
support the following: 

• Many different PC platforms, including the Compaq 386/20, Compaq Systempro 386, any 386 with 
the Chips and Technologies chipset, any 486 with the OPTI chipset, Toshiba 3100SX, and more. 

• ESDI, IDE, and ST-506 drives 

• 3-1/2 inch and 5-1/4 inch floppy drives 

• Novell NE2000 and Western Digital Ethernet controller boards 

• EGA, VGA, CGA, and MDA monitors 

• 287/387 floating point, including the Cyrix EMC 

• A single-floppy standalone UNIX system, containing support for modems, Ethernet, SLIP, and 
Kerrnit to facilitate downloading of 386BSD to any PC over the INTERNET network. 

Those of you who can meet University of California requirements should obtain a copy of 386BSD from 
the University of California, so that you can follow along yourself as we work through the basics of this 
port from every angle. 


14 



In addition, we would like to thank some of the people who have helped make 386BSD a reality, including 
Mike Karels, Keith Bostic, and Kirk McKusick of CSRG, Dixon Dick and all the support engineers at 
Compaq, Fred Dunlap and Bob McGhee of Cyrix, Don Ahn (UCB), Tim L. Tucker (Evans and 
Sutherland), and Clem Cole (Cole Computer Consulting). 

Suggested Readings 

• Leffler, Samuel J., Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman. The Design 
and Implementation of the 4.3BSD UNIX Operating System. Reading, Mass.: Addison-Wesley, 

1989. 

• Crawford, John H., and Patrick P. Gelsinger. Programming the 80386. Alameda, Calif.: Sybex, 1987. 

• IBM Technical Reference: Personal Computer AT. Boca Rotan, Fla.: IBM, 1984. 

386 Segmentation and Paging 

The 386 has six segment registers (CS, DS, SS, ES, FS, and GS) which can select one of 16,383 (8,191 
shared and 8,192 private) segment descriptors. These segment descriptors reside in either the Global 
Descriptor Table (GDT) or the Local Descriptor Table (LDT) and determine underlying characteristics 
(type attributes, location in linear address space, and segment size). In addition to memory segments, 
system segments are available to the operating system for special purposes and call gates to facilitate 
controlled indirection into other possibly hidden segments. 

Memory segments can be selected via a dedicated segment register, with different results. The CS register 
contains program instructions. The DS register selects program data. The SS register selects the program 
stack. The ES register selects the destination of string instructions. Both the FS and GS registers are 
undedicated at this time. It is even possible to reassign the segment registers in the machine instructions, 
so one can view the ES, FS, and GS segment registers as alternative DS segment registers. 

Each memory segment has a size, and can be as large as 4 gigabytes. In order for that segment to be 
active, however, it must consume space (global linear address space) in direct proportion to its size. This 
means that, although a process may possess a total address space greater than 4 gigabytes, only an 
aggregate of active segments totaling less than or equal to 4 gigabytes is permitted. While the 386 
theoretically can address 2{ 14} x 2{46} bytes, in practice only 2(32} bytes (4 gigabytes) can be active at 
any time. If the maximum 4 gigabytes of instruction, data, and stack (for both operating system and each 
user process) is invoked, managing the global linear address space to allow segments to be active (present) 
when linear address space is available becomes a significant problem. 

Segments can also be overlapped in linear address space. Because through both segments we can access 
the same memory interchangeably, possibly with different attributes, this overlap is called an alias. 

80x86 segments can be either "bottom up" or "top down." A segment that is bottom up means that one 
begins with segment relative address 0 and "grows up" to the desired address x (that is, [0 ... x]). A 
segment that is top down means that one begins with segment relative address Oxffffffff and "grows down" 
to the desired address y ([y ... Oxfffffff]). (Yes, we know this is awkward, but that’s how it works). 
Segments are grown only in accordance to these rules. The stack segment is the only common example of 
a downward growing segment. 


15 



Many other attributes are provided that control the type of access allowed within the segment. The 
designers of the 386 prefer segments be used in memory protection regulation, and have provided a 
plethora of features not found in the paging unit. Segment attributes, such as 32-bit vs. 16-bit operations, 
byte vs. page granularity, and user vs. supervisor mode, control the mode of the microprocessor, 
depending on the segments that are actually in use. 

It is quite costly to implement segments in the microprocessor. That is why underlying shadow registers, 
invisible to the programmer, are used. They provide a hardware "assist" to the segmentation functionality. 

We manage to avoid many paging bookkeping problems by running in "flat" mode. This is accomplished 
by aliasing the CS, DS, SS, and ES segment registers to the exact same linear address space (see Figure 4), 
thus making it an identity function. We can then regard any of the intrasegment addresses as if they were 
linear address space. Of course, this ends up defeating the advantages of segments as well. 

Some new microprocessors, such as the 386, feature architectures which exploit large segments. This is 
because 4 gigabytes is starting to fill up, and going to 64-bit addresses will not be happening soon. Many 
would argue that 4 gigabytes will never be filled, but history states otherwise. 64-Mbit RAM is already on 
the drawing boards — in fact, some actually exist. In a few years, it will be commercially available. 
Because a typical computer uses on average 64 to 128 RAM chips, with many companies currently 
offering 64-Mbyte systems (512 1-Mbit RAM), it will not be long before a computer with 512 64-Mbit 
RAM chips (4 gigabytes) is introduced. As such, segmented architectures may provide a way of spanning 
the address space gap that could result. 

It’s amazing that at the beginning of the microcomputer revolution, an Altair 8800 with 4 Kbyte of RAM 
was considered incredible because it could run Basic! How time change. 

We have seen how segmentation works in the 386. Now let’s examine paging. For our purposes, 
segmentation on the 386 is defeated by running in "flat" mode. We can then consider intrasegment 
addresses as if they are linear address space. 

Paging works with a two-level scheme that permits the sparse allocation of address space, so that the 
whole address space, or even all of the address space mapping information, need not be present. 

Otherwise, a 4 gigabyte process would require more than 4 Mbyte of page tables, even though it may be 
the case that only a few thousand would be active at any time. Typically, for our purposes, only three 
pages of page tables are allocated per process (page directory and the top and bottom address space page 
tables). This is sufficient to run a 4-Mbyte process (instruction plus data size) and 4 Mbyte of stack. (Note 
that all processes run with a full-sized address space and can dynamically grow to use it.) This mechanism 
is quite successful in reducing memory-management overhead. 

The two-level scheme splits the incoming virtual address into three parts: 10 bits of page table directory 
index, 10 bits of page table index, and 12 bits of offset within a page. The page table directory is a single 
page of physical memory that facilitates allocation of page table space by breaking it up into 4-Mbyte 
chunks of linear address space per each of its 1024 PDEs (Page Directory Entry), which determine the 
location of underlying page tables in physical memory. 

Each PDE-addressed page of a page table contains 1024 PTEs (Page Table Entry). A PTE is similar in 
form and function to a PDE. The major difference between a PDE and a PTE is that a PTE selects the 
physical page frame for the desired reference. Once the frame offset least-significant address bits are 


16 



obtained, the final address is determined. This method is identical to that used in many other common 
microprocessors (the MC68030, Clipper, and NS32532, among others). 

Each PDE and PTE may be marked either "invalid" (not currently used) or "valid" (the underlying page of 
physical memory is present). In addition, other attribute bits mark entries as "read only" or "read-write" 
and "supervisor" or "user." Because segmentation is not used to control memory protection, we keep 
processes honest by relying entirely on the paging mechanism’s attributes for protection as well as for the 
allocation of memory. 

The mechanism to convert virtual to physical addresses is quite elaborate. To speed things up, the 386 
keeps a Translation Look-aside Buffer (TLB) of 64 cached entries, managed entirely transparently. One 
side affect of this hardware is that if the operating system changes any of the page tables that may be in 
use, it must flush this cache. The 386 does not allow selective flushing — only a complete flush of all 
cache entries by reloading the page directory address register cr3. This is an expensive operation which 
may be repeatedly performed as we successively transform an address mapping of a process within the 
kernel (as many as six times in the worst case). 

—B.J., L.J. 


[FIGURE 6] 

/* Intel 386 process control block */ 


struct pcb { 






struct 

i386tss 

pcbtss; 

tdefine 

pcb_ksp 


pcbtss.tss_esp0 


tdefine 

pcb_ptd 


pcbtss.tss_cr3 


tdefine 

pcb_pc 


pcbtss.tss_eip 


tdefine 

pcb_psl 


pcbtss.tss_eflags 

tdefine 

pcb_usp 


pcbtss.tss_esp 


tdefine 

pcb_fp 


pcbt s s.t s s_ebp 


/* Software 

pcb (extension) 

*/ 




int 


pcb_fpsav; 


tdefine FP 

_NEEDSAVE 

Oxl 

/* 

need save on next context switch */ 

tdefine FP 

_NEEDRESTORE 

0x2 

/* 

need restore on next DNA fault */ 


struct save87 pcb_savefpu; 


struct 

pte 

*pcb_p0br; 





struct 

pte 

*pcb_plbr; 





int 


pcb_p01r; 





int 


pcb_pllr; 





int 


pcb_szpt; 

/* 

number of pages of user page table */ 



int 


pcb_cmap2; 





int 


*pcb_sswap; 





long 


pcb_sigc[8]; 

/* 

sigcode actually 19 bytes */ 



int 

} ; 


pcb_iml; 

/* 

interrupt mask level */ 



/* Intel 386 Task 

Switch 

State */ 





struct i386tss { 







long 


tss_link; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 

long 


tss_esp0; 

/* 

kernel stack pointer priviledge level 

0 */ 


tdefine tss_ksp 

tss_esp0 





long 


t s s_s s 0; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 

long 


tss_espl; 

/* 

kernel stack pointer priviledge level 

1 */ 


long 


t s s_s s1; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


17 




long 

tss_esp2; 

/* 

kernel stack pointer priviledge level 

2 */ 



long 

tss_ss2; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_cr3; 

/* 

page table directory physical address 

*/ 


tdefine 

tss_ptd 

tss_cr3 






long 

tss_eip; 

/* 

program counter */ 



tdefine 

tss_pc 

tss_eip 






long 

tss_eflags; 

/* 

program status longword */ 



tdefine 

tss_psl 

tss_eflags 






long 

tss_eax; 






long 

tss_ecx; 






long 

tss_edx; 






long 

t s s_ebx; 






long 

tss_esp; 

/* 

user stack pointer */ 



tdefine 

tss_usp 

tss_esp 






long 

t s s_ebp; 

/* 

user frame pointer */ 



tdefine 

tss_fp 

tss_ebp 






long 

tss_esi; 






long 

tss_edi; 






long 

tss_es; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_cs; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_ss; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

t s s_ds; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_fs; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_gs; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_ldt; 

/* 

actually 16 bits: top 16 bits must be 

zero 

*/ 


long 

tss_ioopt; 

/* 

options & io offset bitmap: currently 

zero 

*/ 

}; 



/* 

XXX unimplemented .. i/o permission bitmap 

*/ 


[FIGURE 8] 


#include 


globl 

_write, _errno 



#amtwritten 

= write(fildes. 

address, count); 


_write: 


# 

caller places arguments on stack 

lea 

SYS_write, %eax 

# 

select desired system call 

lcall 

$0x7,0 

# 

call the system 

jb 

If 

# 

if system returns error, handle 

ret 


# 

otherwise return 

1: movl 

%eax,_errno 

# 

save error in global variable 

movl 

$-1,%eax 

# 

indicate error has occured 

ret 


# 

and return 


Figure 10: ISA device controllers: (a) data structures for configuring devices (b) 
sample table of possible devices 

/* Per device structure. */ 
struct isa_device { 


struct 

isa_driver *id_driver; 

/* 

per driver configuration info */ 

short 

id_iobase; 

/* 

Base i/o address register */ 

short 

id_irq; 

/* 

Interrupt request */ 

short 

id_drq; 

/* 

DMA request */ 

caddr_t 

id_maddr; 

/* 

Physical shared memory address on 

int 

id_msize; 

/* 

Size of shared memory */ 

int 

(*id_intr)(); 

/* 

Interrupt interface routine */ 


18 



int id_unit; 

int id_scsiid; 

int id_alive; 

}; 

/* Per driver structure. */ 
struct isa_driver { 

int (*probe) () ; 

int (*attach) () ; 

char *name; 

} ; 

[FIGURE 10b] 

/* ISA Bus devices */ 

♦include "machine/isa/device.h" /* device structure */ 

/* Software drivers */ 

♦define V(s) V/**/s 

extern struct driver wddriver; extern V(wdO)(); 
extern struct driver cndriver; extern V(cnO)(); 

extern struct driver comdriver; extern V(comO)(); extern V(coml)(); 
extern struct driver fddriver; extern V(fdO)(); 
extern struct driver nedriver; extern V(neO)(); 

/* Possible hardware devices */ 

♦define C (caddr_t) 

struct isa_device isa_devtab_bio[] = { 

/* driver iobase irq drq maddr msiz intr unit */ 


Swddriver, 

IO_WD0, 

IRQ14, 

-1, 

C 

o 

o 

V(wdO), 

0} 

Swddriver, 

IO_WDl, 

IRQ13, 

-1, 

C 

o 

o 

V(wdl), 

i} 

&fddriver. 

IO_FD0, 

IRQ6, 

2, 

C 

o 

o 

V(fd0), 

0} 

&fddriver. 

IO_FDl, 

IRQ6, 

2, 

C 

o 

o 

V(fdl), 

1} 


0 

}; 

struct isa_device isa_devtab_tty[] = { 


/* driver 


iobase irq 


drq 

maddr 

msiz 

intr 

unit 

{ &vgadriver. 

IO_ 

_VGA, 0, 

-1, 

C 

OxaOOOO, 

0x10000, 

0, 

0}, 

{ Scgadriver, 

IO_ 

_CGA, 0, 

-1, 

C 

OxaOOOO, 

0x4000, 

0, 

0}, 

{ &mdadriver. 

IO_ 

_MDA, 0, 

-1, 

C 

0xb8000, 

0x4000, 

0, 

0}, 

{ &kbddriver. 

IO_ 

_KBD, IRQ1, 

-1, 

C 

0, 

0, 

V(kbd0), 

0}, 

{ &cndriver. 

IO_ 

_KBD, IRQ1, 

-1, 

C 

0 

0, 

V(cnO), 

0}, 

{ scomdriver. 

IO_ 

_COM0,IRQ4, 

-1, 

C 

0, 

0, 

V(comO), 

0}, 

{ &comdriver. 

IO_ 

_COMl,IRQ3, 

-1, 

C 

0, 

0, 

V(coml), 

1), 


0 

}; 

struct isa_device isa_devtab_net[] = { 

/* driver iobase irq drq maddr msiz intr unit */ 

{ &nedriver, 0x320, IRQ9, -1, C 0, 0, V(ne0), 0}, 

0 

}; 

struct isa_device isa_devtab_null[] = { 

/* driver iobase irq drq maddr msiz intr unit */ 

0 

}; 


/* Physical unit number within driver */ 
/* SCSI id if SCSI device */ 

/* Device is present and accounted for */ 


/* Test whether device is present */ 
/* Setup driver for a device */ 

/* Device name */ 


19 



Figure 5 



Figure 5: 386BSD jprocess virtual address space 


Figure 4 


20 




















Kernel 

Global 

Address 

Space 


User 

Process 

Private 

Address 

Space 


Figure 4: 386BSD segment registers 



Figure 9 


21 
















Physical Memory Address 


Max. Memory 


0x100000 


OxOAOOOO 


0x000000 



Figure 9: ISA physical memory map 

Figure 3(b) 


22 




Process Memory Reference Linear Address Space 

(segment register and offset) LDT 


DS(15:3)- 


DS: SA 


dc scrip! | 


GDT 


-► sd.scf base + SA 


(B) 


Linear Address Space Physical Memory 

Page Table Dl reclory A Page T able Physical Page of Data 


VA 

p ■ 

pM^VA(31.22)]i 

1 

cr3 



(M[VA(i2l..12}J 


VA(11 

.0) 


| Pda j 


4 

4 

i P 1e i 





Data 



4 

a 

• 







pda.pf num 


pte.pf num 


0» 


Figure 3: 386 memory> management; (a) segementation 
(b) paging 


Figure 1 


23 














Unix Family Tree 


PWB(UTS 1.0) 


System 3 

System 5.1 
System 5,2 

I 

System 5.3 


2.8BSD 


2.9BSD 


MACH 


3BSD 

4.0BSD 

i 

4 1 BSD 


4.2BSD 


4.3BSD 


2.10BSD 


System 5.4? 


OSF/1? 


4.4BSD? 


AT&T 
Bell Labs 


UNIX Inti CMU OSF Univ. Calif. 

(AT&T) 


Figure 1: The UNIX family tree 


Figure 3(a) 


24 



Process Memory Reference Linear Address Space 

(segment register and offset) LDT 


DS(15:3)- 


DS: SA 


dc scrip! | 


GDT 


-► sd.scf base + SA 


(B) 


Linear Address Space Physical Memory 

Page Table Dl reclory A Page T able Physical Page of Data 


VA 

p ■ 

pM^VA(31.22)]i 

1 

cr3 



(M[VA(i2l..12}J 


VA(11 

.0) 


| Pda j 


4 

4 

i P 1e i 





Data 



4 

a 

• 







pda.pf num 


pte.pf num 


0» 


Figure 3: 386 memory> management; (a) segementation 
(b) paging 


Figure 2 


25 















Figure 2: 386BSD and other BSD releases 


Porting Unix To The 386: Three Initial Pc Utilities 

Utilities to let you execute GCC- compiled programs in protected mode from MS-DOS and copy files to a 
shared portion of disk so MS-DOS and UNIX can exchange information. 

Getting to the hardware 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9 BSD and the chief architect of National Semiconductor’s 
GENIX. Lynne established TeleMuse, a market research firm specializing in the telecommunications and 
electronics industry. They can by contacted via e-mail at william@berkeley.edu or at uunet!william. 
Copyright (c) 1990 TeleMuse. 

In last month’s installment, we discussed the elements of the 386BSD port, which required planning prior 
to the actual coding. In brief, the specification we outlined emphasized BSD compatibility, efficient use of 
the 80386 architecture, interoperability with extant commercial standards, and rapid implementation to 
leverage BSD UNIX to port the rest of itself. We also discussed the conflicts inherent between a 
segmented architecture and a virtual memory system which prefers paging, other microprocessor 
idiosyncrasies and requirements, and the basic planning for the surrounding hardware. By taking a 
"practical approach" to this port and focusing on "hard adherence" to BSD operability and 
high-performance, we identified the key milestones required for this (or any) advanced operating system 
port and set the stage for our next effort: Writing the PC utilities that allow us to initially load the first 


26 




programs and data onto our 386 target host. 

With this in mind, we’ll now examine code from three PC-based utilities — boot.exe, cpfs.exe, and 
cpsw.exe — that facilitate the basic access to the hardware from MS-DOS needed to begin a UNIX port, 
boot.exe executes a GCC-compiled program (using the Free Software Foundation’s GNU C Compiler) in 
protected mode from MS-DOS. (Note that GCC generates only 32-bit protected-mode code.) cpfs.exe 
installs a root filesystem onto the hard disk, cpsw.exe copies files to a shared portion of disk so that 
MS-DOS and UNIX can exchange information. 

In examining these areas, we will illustrate how the UNIX bootstrap process functions, because these 
programs mimic that process to a great degree. This will be important in later articles when we discuss the 
code and strategies used to build the bootstraps that allow the newly ported system to become independent 
of MS-DOS. 

The Purpose of Our PC Utilities 

To port UNIX, we needed to devise methods to: Load large 32-bit protected-mode programs (that is, the 
BSD kernel); load the initial root filesystems; and communicate information onto our early UNIX system 
to augment its capabilities as we port increasing numbers of utilities. 

An initial UNIX port to a brand-new architecture with no native software can be a miserable task for the 
inexperienced. One of the authors has done this for other architectures and survived, but we don’t 
recommend it because it forces you to write absolute code for the purposes outlined above, only to 
abandon it for the UNIX code, which eventually provides the same function. Writing absolute code is 
difficult to debug (because there is no debugging environment), time consuming (one needs to support and 
initialize the entire machine in addition to the above functions), and subtle (tiny machine-dependent 
characteristics thwart the development effort). These concerns, especially when working with a processor 
as complex as the 386, arise when the port is most vulnerable — when there is little project history. 

One of the advantages of porting an operating system to a popular machine like the PC lies in the wealth 
of previously written program development software. (In other words, someone else has already suffered 
to our art.) We chose to use Borland’s Turbo C and Microsoft’s MASM, primarily because they "were 
there," and also because they were appropriate to rapid PC program prototyping. While these programs do 
rely on a few object library primitives in Turbo C, they are reasonably portable to other MS-DOS C 
implementations, and on the whole are not restricted solely to Turbo C (or MASM, for that manner). 

Another advantage of the 386 PC environment is that MS-DOS and it’s applications programs run on the 
absolute machine, and do not rely on memory management, relocation, or protection. Thus, we could write 
programs that would ultimately usurp control from the MS-DOS operating system without regard for it’s 
functions and strategies. An operating system that makes extensive use of memory management 
mechanisms, such as System V UNIX, would have made it more difficult to write and debug an absolute 
program loader. In this case, we would have spent more time defeating those mechanisms than we would 
have spent writing absolute programs in the first place! 


27 



The First PC Utility: boot.exe 


boot.exe is quite simple in theory, as our mock code fragment in |Example l| demonstrates. It just loads a 
GCC executable into memory at location 0, enters into protected mode, and then executes it. Simple, huh? 
There are some niggling little gotchas, however: 


Example 1: Mock code that loads a GCC executable into memory 

raainf) { 

int fi; 

struct exec hdr; 

fi = open ("pgm", 0_RD0NLY); 
read (fi, &hdr, sizeof (hdr)); 
read (fi, (char *) 0, hdr.a_text + 
hdr.a_data) ; 

(* (void * () 0) (); 

/* NOTREACHED */ 

} 

• Programs are frequently larger than is considered "convenient" in the PC world. On the PC, 64K or 
less is considered adequate, while the UNIX kernel we must load averages about 280 Kbytes in size, 
so we will have to manage the so-called "far" pointers in a large model 8086 program. 


[More Details. 


• The bottom (address 0) of PC memory contains a critical portion of the MS-DOS operating system. 
We will need to use MS-DOS itself to load the program, so we can’t touch this area until after we 
read in the entire program. We will therefore have to allocate a pool of memory space large enough to 
temporarily hold a copy of the program we are loading until it is safe to overwrite location 0. 

• Once we enter into protected mode, we can’t easily go back and enter MS-DOS again, so we must do 
all our checks and anticipate needs prior to taking that last giant step. 


Listing One|is the boot.c program which resolves these three areas. Note that it is no longer a simple 


eight-line program. Ah well, life is never simple. 


The GCC Executable Format 


The programs to be loaded have been generated on another UNIX system, where the GCC compiler, GAS 
assembler, and BSD linkage editor provide cross-development support, allowing us to generate BSD a.out 
format files. This format is the oldest of the many (and, unfortunately still growing) different UNIX 
executable file formats. The BSD a.out format consists of a header structure (see Listing Two exec.h) that 
details the size of sections following, the instruction segment (or text), the data segment, relocation 
information, and finally, a symbol-table segment. At this time, we are interested only in the information 
contained in instruction and data sections, which we then load into a large, dynamically allocated 
temporary array, before moving it into position. We do not use the relocation information or the 
symbol-table segment. 


28 





Consistency Checks 

Loading this large array of data containing the programs to be executed is a complex task, because many 
different 64K segments may be used. A "fence-post" error arising from incorrectly maintained far pointers 
can lead to unpredictable results when the protected mode program runs. Therefore, to verify that the 
program contents are loaded correctly, we use a simple checksum just before we dispatch to it in protected 
mode (see ‘ Listing One 

<#porting-unix-to-the-386-three-initial-pc-utilities-getting-to-the-hardware-article-porting-unix-to-the-386-three-initial-pc-utilities-by>‘_ 

, boot.c). 

GATE A20 


Another feature which deserves mention involves the PC hardware feature known as GATE A20. Because 
the original IBM PC had only 20 bits of address (2{ 10} or 1 Mbyte, denoted as A19 < — > AO), newer 
machines possessing greater physical address space (80286 with 16 Mbytes and the 80386 with 4 
gigabytes) might exhibit a small difference when executing in real mode. GATE A20 was designed to 
mitigate this problem. Without it, a reference at the topmost address incrementing up would actually 
reference outside of the 20-bit address space, because the rollover would be carried up instead of being 
wrapped around to address zero. GATE A20 would not be necessary were it not for the presence of a 
considerable body of ancient MS-DOS applications that rely on the address space execution, assuming that 
this would rollover to the same address space occupied by the bottom of physical memory. Thus, the 
urgent need for GATE A20 (short for "Gate the A20 address line to logic zero"). With our UNIX system, 
we will want to grab all available RAM memory, especially that above 1 Mbyte, so we need to defeat the 
GATE A20 feature and allow all the processor’s address lines to be functional. We did this with our 


gatea20.asm module in|Listing Three|invoked by boot.c in|Listing One 


Entering Protected Mode 


Protected-mode programming frequently has a mystique about it, probably due, in large part, to the 
difficulty in going between modes on the 80(2/3/4)86 processors on which it is supported. You can’t just 
poke a bit, or execute a single instruction, and end up executing in protected mode. The transition is a 
methodical one, where, over the course of tens of instructions, the processor is incrementally prepared for 
the transition (which, by the way, is not intuitively obvious). This, of course, gives errors many 
opportunities to sneak in. Writing and debugging a subroutine for reliable entry into protected mode was 
not exactly the evening’s diversion we estimated; embarrassingly enough, it took nearly a month. 


As you examine the code from protentr.asm in |Listing Fouij you can see that many different things are 
being reconciled at once. There are three different kinds of addressing standards being interconverted as 
needed: 


• 20-bit segment:offset pair "real" mode addresses 

• 32-bit absolute or physical addresses 

• 32-bit segment selector: offset protected mode addresses 

Protected-mode instructions are being "generated" from within a "real" mode assembler. A descriptor table 
is encoded in its peculiar and convoluted structure style, which has its base address split into high and low 
address chunks on separate portions of the descriptor. Note that in some versions of MASM, LIDT/LGDT 


29 




instructions present undocumented surprises. 

Our goal with this subroutine is to turn the 386 into a "flat" 32-bit address space, reminiscent of a 68000, 
and to dispatch to location 0 to execute the above loaded program. Because we don’t anticipate using any 
other descriptors while our stand-alone program runs, the descriptor table itself is abandoned in memory — 
probably to be written over during protected-mode program execution. 

Interrupts are disabled before entry into protected mode. We don’t yet know where the interrupt and 
exception processing code exists in the protected-mode program, so we must leave the IDT uninitialized 
(zero length). This means that if an exception or interrupt occurs, the processor will spontaneously reset. 
Thus, the first responsibility of a just-loaded 32-bit program must be to sensibly initialize itself to catch 
these conditions. 

Note that the code for entry into protected mode is PIC (Position Independent Code). We can easily 
overwrite the memory of the bootstrap program itself, so we must arrange to copy this entry into 
protected-mode code just above our protected-mode program. This insures its survival when we overwrite 
MS-DOS, and quite possibly our boot program, never to return. 

The Second PC Utility: cpfs.exe 

In addition to being able to run 32-bit protected-mode programs, we need to load a preliminary root 
filesystem for our BSD UNIX kernel to access as it initializes itself during the late phase of boot-up. Like 
MS-DOS, UNIX needs a dedicated region of the hard disk drive to store the data structures and data 
blocks that support its filesystem scheme. As we do not have a drive dedicated to BSD, we must instead 
secure a second partition on the sole disk drive to contain the BSD root filesystem. 

cpfs.c (see |Listing Five ) is a program which loads this filesystem from a previously downloaded MS-DOS 
file, cpfs.c leverages BIOS disk calls to write appropriately to the absolute disk. If this program were to be 
commonly used, you might wish to dig out the disk geometry and BSD partition from additional MS-DOS 
and/or BIOS calls, but for our purposes, this program is sufficient. 

The first block of the root (typically, the first sector of the drive) contains the disk label (see Listing Six 
diskl.h). This data structure will eventually be used in the 386BSD port to make the system 
drive-independent. However, we first need to place the seminal label on the very first filesystem, cpfs.c, 
which has hard-wired geometry constants, can initially be compiled (by defining "FIRST") to blindly write 
that first filesystem with this label. In subsequent use (compiled without "FIRST" defined) cpfs.c will use 
this label to validate a load, hopefully saving a bleary-eyed developer from ultimate disaster. 

When using this program with disk drives greater than 1024 cylinders, logical translation by the disk 
controller to a different geometry is a problem. Some calls used by MS-DOS and applications programs 
would invoke a 10-bit field for cylinder address in the disk address data structure, reflecting an early 
limitation of some PC disk controllers (the WD1010 disk controller chip, for example). One clever 
workaround which doesn’t require altering the operating system is to encode disk addresses with a logical 
mapping scheme so that some of the cylinder address bits would be mapped into more plentiful sector and 
head address bits. This scheme, while quite acceptable to MS-DOS (which is never picky about sector 
placement), is not acceptable to BSD (which is extremely picky about sector placement). The BSD Fast 
Filesystem uses rotational and head placement algorithms to improve filesystem performance by taking 
disk latency into account. Therefore, running on a logically mapped disk may significantly degrade 


30 




performance by throwing off this mechanism in the Fast Filesystem. Additional code is required to detect 
and defeat this condition, because this translation must be maintained while MS-DOS is running. 


The Third PC Utility: cpsw.exe 

Our last of three programs is of use when the early BSD operating system kernel is running and must 
receive additional files. Ordinarily, we would prefer to use either serial communications or floppy disks to 
add files to our nascent BSD root filesystem. Flowever, our early kernel has drivers for only the display, 
keyboard, and hard disk drive (that is, the "bare minimum"), because we want to use the system itself to 
develop and test further extensions and improvements. In a nutshell, we want to leverage our tiny BSD 
UNIX system with MS-DOS’s drivers and applications programs by using MS-DOS to receive 
information into a MS-DOS file, and then using a trivial program to place this information on a reserved 
portion of the disk, where BSD can easily access it. 


At this point, we had seriously considered giving BSD the ability to read MS-DOS file structures directly, 
but this is a nontrivial process and we wished to devote our energies toward developing and improving the 
BSD kernel to become self-supporting. As a consequence, we decided to push this project off and favor 
instead an expedient solution to a temporary problem. 


If more disk space had been available, a partition could have been dedicated to the MS-DOS 
communications functions. Unfortunately, our early host machine contained only a 40-Mbyte drive, so we 
were very tight on space. (Yes, I know we have large machines now, but when you begin a project, it is 
usually on the cheap until you convince people that it is worthwhile — of course, by that time you’ve 
probably finished the project, or at least a fair portion of it). We elected to force the swap space to do 
double duty, by arranging to use it to hold information from and to MS-DOS just after or before BSD 
system operation. We were counting on the fact that we only use the swap space when the system really 
gets rolling. While this arrangement is somewhat heretical, it worked adequately enough to let us finish 
our nascent system to the point where it no longer required MS-DOS to boot or exchange files with other 
systems. 


cpsw.c (see |Listing Seven| ) differs from cpfs.c in that it uses the disk label to configure itself. Disk 
geometry is determined entirely from the disk label. Prior to using cpsw.c, a TAR file image is created on 
a cross-host. This file is then transferred to an MS-DOS file via one of the many MS-DOS 
communications programs available, cpsw.exe is used to make this file accessible to 386BSD. 386BSD is 
then booted, and the 386BSD TAR utility is invoked to extract the information (prior to paging). This 
method is somewhat tedious, but proved adequate for the early stages of this port. 


cpsw.exe is very similar in function to cpfs.exe, and both could be subsumed into a single program. 
Simplicity, however, has allowed us to achieve our goal of getting 386BSD off the ground and running, 
without becoming an outright diversion into a MS-DOS/ UNIX merger, a weighty and significant 
objective not suited for an early operating system project of considerable and ever-increasing scope, but 
still short on history. 


31 



Where We Go From Here 


Now that we have our PC utilities in place, we can plan for the next stage in our 386BSD effort: 
development of the stand-alone system /sys/stand and its utilities. This system will possess the 
rudimentary drivers and a library of support routines which allow GCC programs to access devices and 
UNIX file-structures on the hard disk. It will also provide us with a platform to examine the requirements 
which must be met so that the 386 will support features to be incorporated into 386BSD. 

The 386BSD Project and Berkeley UNIX 

The 386BSD project was established in the summer of 1989 for the specific purpose of porting the 
University of California’s Berkeley Software Distribution (BSD) to the Intel 80386 microprocessor 
platform. Encompassing over 150 Mbytes of operating systems, networking, and applications software, 
BSD is a fully functional and nonproprietary complete operating systems software distribution. The goal 
of this project was to make this cutting-edge research version of UNIX widely available to small research 
and commercial efforts on an inexpensive PC platform. By providing the base 386BSD port to Berkeley, 
our hope is to foster new interest in Berkeley UNIX technology and to speed its acceptance and use 
worldwide. We hope to see those interested in this technology build upon it in both commercial and 
noncommercial ventures. 

In each of these articles we will examine the key aspects of software, strategy, and experience that make 
up a project of this magnitude. We intend to explore the process of the 386BSD port, while learning to 
effectively exploit features of the 386 architecture for use with an advanced operating system. We also 
intend to outline some of the trade-offs in implementation goals, which must be periodically reexamined. 
Finally, we will highlight extensions which remain for future work. 

Currently, 386BSD is available on the 386 PC platform and supports the following: 

• Many different PC platforms, including the Compaq 386/20, Compaq Systempro 386, any 386 with 
the Chips and Technologies chipset, any 486 with the OPTI chipset, Toshiba 3100SX, and more 

• ESDI, IDE, and ST-506 drives 

• 3.5 inch and 5.25 inch floppy drives 

• Cartridge tape drive 

• Novell NE2000 and Western Digital Ethernet controller boards 

• EGA, VGA, CGA, and MDA monitors 

• 287/387 floating point, including the Cyrix EMC 

• A single-floppy, stand-alone UNIX system, supporting modems, Ethernet, SLIP, and Kermit to 
facilitate down-loading of 386BSD to any PC over the INTERNET network. 

Copies of 386BSD source code can be obtained by contacting the Computer Systems Research Group 
(CSRG) at UC Berkeley. Some restrictions may apply. 

While working with us through our 386BSD article series, the following texts on Berkeley UNIX and the 
80386 microprocessor are also recommended: 


32 



* The Design and Implementation of the 4.3BSD UNIX Operating System, by Samuel J. Leffler, 
Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman (Addison-Wesley, 1989). 

* Programming the 80386 by John H. Crawford and Patrick P. Gelsinger (Sybex, 1987). 

* IBM Technical Reference: Personal Computer AT (IBM, 1984). 

In addition, an augmented and detailed book on 386BSD by William Frederick Jolitz and Lynne Greer 
Jolitz, The 386BSD Handbook, will be available in the summer of 1991. 

- B.J., L.J. 

[LISTING ONE] 

/* Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 7/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* This program allows the bootstrap load of GCC cross compiled 

* 32 bit protected mode absolute programs onto the obtuse architecture 

* of PC AT/386, destroying the running DOS in the process. 

* Currently works with TURBO C 1.5 & MASM 5.0, relies on farmalloc(). 

*/ 

♦pragma inline 

♦include #include ♦include iinclude ♦include #include "exec.h" 

♦define PGSIZE 4096 

♦define CLOFSET (PGSIZE - 1) /* 386 page roundup */ 

/* Header record of BSD UNIX executable file */ 
struct exec exec; 

long far_read(), to_long(); 
char far *to_far(); 

char far *add_to_far(char far *p, long n) ; 

/* Get a file we can open, attempt to load it */ 
main(argc, argv) 
char *argv[]; 

{ int fd, i; 

long addr, totalsz; 
char far *base; 

if (argc != 2) { 

printf("usage: boot \n"); 
exit (1) ; 

} 

fd = open(argv[1], 0_BINARY); 
if (fd < 0) { 

printf("boot: Cannot open file \"%s\" \n", argv[l]); 
exit (1) ; 

} 


33 



/* Reasonable file to load? */ 
i = read(fd, (char *)&exec, sizeof exec) ; 
if (i != sizeof exec | 

(exec.a_magic != OMAGIC && exec.a_magic != NMAGIC 
&& exec.a_magic != ZMAGIC)) { 

printf("Not a recognized object file formatXn"); 
exit (1); 

} 

/* Allocate buffer for temporary copy of protected mode executable 

Buffer space requirements: |<—a. out->| pageroundup heap */ 

totalsz = exec.a_text + exec.a_data + exec.a_bss + 4096L + 20*1024L; 

/* Pad with trailing portion to put protected mode entry code in */ 
base = farmalloc(totalsz + 64*1024L); 
if (base == 0) { 

printf("Cannot allocate enough memory\n"); 
exit (1); 

} 

/* Load Instruction (e.g. text) portion of file */ 
printf ("Text %ld", exec.a_text); 

if (far_read(fd, base, exec.a_text) != exec.a_text) 
goto eof; 

/* Load Data portion of file */ 
addr = exec.a_text; 

/* Adjust for page alignment for pure procedure format */ 
if (exec.a_magic == NMAGIC && (addr & (PGSIZE-1))) 
while (addr & CLOFSET) 

* add_to_far(base, addr++) = 0; 
printf("XnData %ld", exec.a_data); 

if (far_read(fd, add_to_far(base,addr), exec.a_data) != exec.a_data) 
goto eof; 

/* Clear Uninitialized data (BSS) space */ 
addr += exec.a_data; 
printf("XnBss %ld", exec.a_bss); 
for ( ; addr < totalsz; ) 

* add_to_far(base,addr++) = 0; 
if(exec.a_entry) 

printf("XnStart 0x%lx", exec.a_entry); 

#ifdef CKSUM 

/* Optionally calculate checksum to validate against cross host's copy. */ 
far_cksum(base, addr-lL) ; 

#endif CKSUM 

printf("\n"); 

/* Effect copydown to absolute 0 and entry into protected mode at 
location "a_entry". */ 

transfer(base, totalsz, exec.a_entry); 

/* NOTREACHED */ 

eof: 

printf (" - File incomplete, load abortedXn"); 
exit (1); 


34 




/* We use the routines below to always keep far pointers normalized 
* to simplify comparision/subtraction. */ 
char far *to_far(l) long 1; { 

unsigned seg, offs; 
seg = 1>>4; 
offs = 1 & Oxf; 
return(MK_FP(seg,offs)); 


long to_long(f) char far *f; { 

unsigned long 1; 

1 = FP_SEG(f)*16L + (unsigned long)FP_OFF(f); 
return(1); 


char far *add_to_far(f,1) char far *f; long 1; ( 

return(to_far(to_long(f)+1)); 

} 

char far ‘normalize(f) char far *f; { 

unsigned seg,offs ; 

/* add in offset */ 

seg = FP_SEG(f); offs = FP_OFF(f); 
seg += (offs >> 4) ; offs &= Oxf ; 

return(MK_FP (seg, offs)); 

} 

/* read() that works anywhere in DOS address space for any size data, 

* works via bounce buffer. Not designed for speed or elegance. */ 
long far_read(io, base, len) int io; long len; char far ‘base; ( 
char far *fp; 

/* normalize far pointer to handle segment rollover case */ 
fp = base = normalize(base) ; 
while (len) { 

static char dbuf[PGSIZE]; 
long rlen,tlen; 

/* bounce buffer between my data segment and ultimate dest */ 
tlen = (len > PGSIZE)? PGSIZE : len; 

if ((rlen = read (io, dbuf, tlen)) < 0) return (rlen); 

/* shoot into place */ 

movedata (_DS, (unsigned)dbuf, FP_SEG(fp), FP_OFF(fp), rlen); 

/* update transfer address and count */ 

fp = add_to_far(fp, rlen); 

len -= rlen ; 

if (tlen != rlen) break ; 

} 

return (to_long(fp) - to_long(base)); 

} 

extern far protentry(); /* known to be less than 0x200 bytes long */ 
extern far gatea20(); 


35 



/* set up to transfer to 386 program; call protentry to do the dirty work. */ 
transfer(base, len, entry) char far *base; long len, entry; { 
unsigned seg,offs ; 
long rbase; 
char far *fp; 

/* Copy code to top of the system and execute there. This keeps it 
from getting stepped on. */ 

/* make 32 bit address */ 

rbase = to_long(base); 

fp = add_to_far(base,len); 

seg = FP_SEG(fp); offs = FP_OFF(fp); 

/* protect possible conflict of top paragraph of bss */ 
if (offs) seg++ ; 

/* force to protentry's offset so offsets agree */ 
offs = FP_OFF(&protentry) ; 

movedata (FP_SEG(Sprotentry), offs, seg, offs, PGSIZE); 

/* degate A20 - from now on, full physical memory address bus */ 
gatea20(); 

/* enter prot_entry program, relocated to top of loaded program, via 

intersegment return */ 

asm push word ptr rbase+2 ; 

asm push word ptr rbase ; 

asm push word ptr len+2 ; 

asm push word ptr len ; 

asm push word ptr entry+2 ; 

asm push word ptr entry ; 

asm push word ptr seg; 

asm push word ptr offs; 

asm db Ocbh; /* lret - intersegment return */ 

/* within protentry: go into 32 bit mode, copy entire system to 0 with 
single string instruction, intrasegment jump to entry point */ 
printf("protentry returned?!?\n") ; 
exit (1); 

/* NOTREACHED */ 

} 

#ifdef CKSUM 

/* 16 bit checksum of program. */ 

far_cksum(base, len) long len; char far *base; ( 
char far *tmp; 
unsigned seg,offs ; 
long nbytes,sum, tlen; 
tmp = base; 
sum = 0; 
nbytes = 0; 
while (len) { 

/* normalize far pointer to handle segment rollover case */ 
tmp = normalize(tmp); 


36 



/* Do a page at a time */ 
tlen = (len > PGSIZE)? PGSIZE : len; 
len -= tlen ; 
while (tlen—) { 

nbytes++; 
if (sum&Ol) 

sum = (sum»l) + 0x8000; 
else 

sum »= 1; 
sum += *tmp++ ; 
sum &= OxFFFF; 

} 

} 

printf("\nChecksum %051u%61d ", sum, (nbytes+CLSIZE)/PGSIZE); 

} 

#endif CKSUM 

[LISTING TWO] 


/* Excerpted with permission from 4.3BSD include file 

* "/usr/include/sys/exec.h" 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* Header prepended to each a.out file. 

*/ 

struct exec { 


long 

a_magic; 

/* 

magic number */ 

unsigned long 

a_text; 

/* 

size of text segment */ 

unsigned long 

a_data; 

/* 

size of initialized data */ 

unsigned long 

a_bss; 

/* 

size of uninitialized data */ 

unsigned long 

a_syms; 

/* 

size of symbol table */ 

unsigned long 

a_entry; 

/* 

entry point */ 

unsigned long 

a_trsize; 

/* 

size of text relocation */ 

unsigned long 
}; 

a_drsize; 

/* 

size of data relocation */ 

#define OMAGIC 

0407 

/* 

old impure format */ 

♦define NMAGIC 

0410 

/* 

read-only text */ 

♦define ZMAGIC 

0413 

/* 

demand load format */ 

[LISTING THREE] 



title _gatea20 




; Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

; Written by William Jolitz, July 1989 

; THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 
; IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 
; WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

; (void) gatea20(); 

; Enable Address Bit 20 that was disabled by BIOS for MSDOS 

; We need it off to use entire memory space of the AT. 

; We do this just prior to entering protected mode, never to return. 


37 




TEXT segment byte public 'CODE' 
assume cs:_TEXT,ds:_TEXT 
TEXT ends 


Status_Port 

equ 

64h 

; 8042 Status Port 

Cmd_rdy 

equ 

2 

; Keyboard is ready? 

Write_outpt 

equ 

Odlh 

; Write next data to output port 

Port_A 

equ 

60h 

; 8042 Keyboard Scan and Diagnostic 

EnableA20 

equ 

Odfh 

; Enable Address bit 20 for use 


_TEXT segment byte public 'CODE' 

; Wait for Keyboard controller to be ready for command 

wait42 proc near 
chkrdy: 

in 
and 
jnz 
ret 

wait42 endp 

; Turn on A20 again. 


_gatea20 

proc far 

call 

wait42 

mov 

al, Write_outpt 

out 

Status_Port, al 

call 

wait42 

mov 

al, EnableA20 

out 

Port_A, al 

call 

wait42 

ret 


_gatea20 

endp 

public 

_gatea20 

_TEXT ends 


end 



[LISTING FOUR] 

title protentry 

; Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

; Written by William Jolitz 7/89 

; Redistribution and use in source and binary forms are freely permitted 
; provided that the above copyright notice and attribution and date of work 
; and this paragraph are duplicated in all such forms. 

; THIS SOFTWARE IS PROVIDED ''AS IS'' AND WITHOUT ANY EXPRESS OR 
; IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 
; WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

; protentry(entry,len,addr,...) long entry,len,addr; 

; Entered via jump or "ret" (e.g. no return address on stack), 

; builds necessary data structures and transfers into 32-bit 

; mode, then copies the 32-bit mode program at address "addr" 

; and byte length "len" to location 0 and enters the program 

; at location "entry". Note that both "entry" and "addr" are 

; true 32-bit absolute pointers, NOT segment:offset pairs. It 


al, Status_Port 
al, Cmd_rdy 
chkrdy 


38 





is assumed that both the stack and this program will not be 
overwritten in the subsequent copy to 0 of the program to be 
entered, so caller is responsible to place this in a location 
above the program. 

Note that this program is position-independant (self relocating). 

Any additional args past the necessary three will be passed on the 
stack to the entered program [note: we obviously don't provide a 
"return" address] 


_TEXT 

segment 

byte public 

'CODE' 


assume 

cs:_TEXT,ds: 

nothing 

_TEXT 

ends 



Data32 

equ 

66h ; prefix 

to toggle 16/32 data operand 

JMPFAR 

equ 

Oeah ; opcode for JMP intersegment 


.186 

; allow use 

of shl ax,cnt insn 

_TEXT 

segment 

byte public 

'CODE' 


_protentry proc far 

jmp short relstrt 

; Global Descriptor Table contains three descriptors: 

; Oh: Null: not used 

; 8h: Code: code segment starts at 0 and extents for 4 gbytes 

; lOh: Data: data segment starts at 0 and extends for 4 gbytes(overlays code) 

GDT : 

NullDesc dw 0,0,0,0 ; null descriptor - not used 

CodeDesc dw OFFFFh ; limit at maximum: (bits 15:0) 

db 0,0,0 ; base at 0: (bits 23:0) 

db 10011111b ; present/priv level 0/code/conforming/readable 

db 11001111b ; page granular/default 32-bit/limit(bits 19:16) 

db 0 ; base at 0: (bits 31:24) 

DataDesc dw OFFFFh ; limit at maximum: (bits 15:0) 

db 0,0,0 ; base at 0: (bits 23:0) 

db 10010011b ; present/priv level O/data/expand-up/writeable 

db 11001111b ; page granular/default 32-bit/limit(bits 19:16) 

db 0 ; base at 0: (bits 31:24) 

; Load Pointers for Tables 

; contains 6-byte pointer information for: LIDT, LGDT 
; Interrupt Descriptor Table pointer 

IDTPtr dw 7FFh ; limit at maximum (allows all 256 interrupts) 

dw 0 ; base at 0: (bits 15:0) 

dw 0 ; base at 0: (bits 31:16) 

; Global Descriptor Table pointer 

GDTPtr dw 17h ; limit to three 8 byte selectors(null,code,data) 

dw offset GDT ; base address of GDT (bits 15:0) 

dw Oh ; base address of GDT (bits 31:16) 

; Constructed instruction for entry into 32 bit protected mode 
; ljmp far Note 

dispat: db Data32 ; 32-bit override prefix 


39 





of f 1 


db JMPFAR ; opcode for JMP intersegment 

dw 0 ; starting address of 32-bit code (low-word) 

dw Oh ; starting address (high word of linear address) 

dw 8h ; CodeDesc selector=8h 

relstrt: 

cli ; disable interrupts 

; do address fixups 

mov ax,ss ; first, make a new 32 bit stack pointer! 

mov cx,4 

shl ax,cl ; ax now contains segment address low 16 bits 

mov bx,ss 

mov cx,12 

shr bx,cl ; bx now contains segment address high 16 bits 

add ax,sp 

adc bx,0 ; ax contains esp 15:0, bx contains esp 31:16 

mov si,ax ; pass new stack to 32bit mode via si & di 

mov di,bx 

mov ax,cs 

mov cx,4 

shl ax,cl ; ax now contains segment address low 16 bits 

mov bx,cs 

mov cx,12 

shr bx,cl ; bx now contains segment address high 16 bits 

mov cx,cs:GDTPtr+2 

mov dx,bx 

add cx,ax 

mov cs:GDTPtr+2,cx 

adc cs:GDTPtr+4,dx 

mov cx, OFFSET(cpydwn) 

mov dx,bx 

add cx,ax 

mov cs:offl,cx ; overflow? 

adc cs:offl+2,dx 

; Load the descriptor tables 

; lidt cs:IDTPtr ; load Interrupt Descriptor Table 

db 2eh,OFh,Olh,00011110b 
dw offset IDTPtr 

; lgdt cs:GDTPtr ; load Global Descriptor Table 

db 2Eh,OFh,Olh,00010110b 
dw offset GDTPtr 

; smsw ax ; put Machine Status Word in AX 

db Ofh, Olh, 11100000b 

or al,1 ; activate Protection Enable bit 

; lmsw ax ; store Machine Status Word, begin protected mode 

db Ofh,Olh,11110000b 

jmp short Next ; flush prefetch queue 

; Load the segment registers with approriate descriptor selectors 

Next: mov bx,lOh ; set segment registers to DataDesc 

mov ss,bx ; load SS,DS,ES segment registers with DataDesc 


40 



mov 


ds, bx 
mov es,bx 

; Load CS via above's constructed ljmp, entering 32 bit protected mode 
jmp short dispat 


; Finally running in Protected 32-bit Mode 

cpydwn: 


mov 

ax, di 

r 

movl 

%edi,%eax 


shl 

ax, 16 

r 

db Oclh,OeOh,lOh 

; shll $16,% 

db Data32 





mov 

ax, si 

r 

movw 

%si,%ax 


mov 

sp, ax 

r 

movl 

%eax,%esp 


pop 

ax 

r 

pop 

eax ; 

entry addr 

pop 

cx 

r 

pop 

ecx ; 

byte size 

pop 

si 

r 

pop 

esi ; 

source address 

xor 

di, di 

r 

xor 

edi,edi ; 

destination address 

cld 






rep 

movsb 

r 

copy 

into place 


mov 

sp, si 

r 

movl 

esp,esi 


jmp 

ax 

r 

jmp 

eax ; 

go to entry 

_protentry 

endp 





public 

_protentry 




TEXT ends 
end 


[LISTING FIVE] 

/* Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 7/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* This program copies a BSD filesystem out of an MSDOS file and 

* places it on an pre-reserved disk partition. Note that both the 

* geometry of the particular disk, and the particulars of the 

* BSD partition need to be adjusted to suit the drive on which this will 

* be used. Normally, this would be a very rude requirement, but 

* we tolerate this because this program is a throw-away used to get 

* us started, and we have better schemes to deal with configuration 

* a little further down the pike. 

* Currently works with TURBO C 1.5 . 

*/ 

♦include iinclude #include ♦include #include "diskl.h" 

/* Disk geometry (here, a NEC DS5146). Adjust parameters to suit drive. */ 


♦define 

NCYL 

615 


♦define 

NTRACK 

8 


♦define 

NSECT 

17 


♦define 

BSIZE 

512 

/* Disk block size */ 


41 




/* Location & size of root partition. Adjust for drive partition layout. */ 
♦define OFF_CYL 290 /* Cylinder offset of start of BSD root partition */ 

♦define ROOTSZ 50 /* size of root partition, in units of cylinders */ 

char trkbuf[NSECT*BSIZE]; 

struct label_blk { 

char bufr[LABELOFFSET]; 
struct disklabel dl; 

} lbl; 

main (argc, argv) char *argv[]; { 

int fi, rem, cyl, head, sector, tfrcnt; 
if (argc != 2) { 

printf ("usage: cpfs \n"); 
exit (1) ; 

} 

fi = open (argv[1],0_BINARY); 
if (fi < 0) { 

printf ("Cannot open \"%s\" file to read filesystem\n", 
argv[1]); 
exit (1) ; 

} 

cyl = OFF_CYL; 
tfrcnt = head = 0; 

♦ifndef FIRST 

/* check for presence of disklabel */ 

biosdisk (2, 0x80, 0, OFF_CYL, LABELSECTOR, 1, &lbl); 
if (lbl.dl.dk_magic != DISKMAGIC) ( 

printf ("BSD Disk partition does not have a label!\n"); 
exit (1) ; 

} 

/* Treat first track of data special; use disk label in first block of 

* file to validate that the file to be loaded and disk drive 

* partition are appropriate for each other. */ 
read (fi, trkbuf, BSIZE); 

if (strncmp (trkbuf, &lbl, BSIZE) != 0) { 

printf ("BSD root partition and filesystem mismatch!\n"); 
exit (1) ; 

} 

/* reset filesystem file to beginning */ 
lseek (fi, 0, SEEK_SET); 

♦endif 

printf ("WARNING! About to overwrite disk (will loose previous\n"); 
printf ("contents). Are you certain of your use of this program?"); 
if (getche () != 'y') exit (1); 

printf("\n"); 

/* Transfer file to absolute disk section, a track at a time, 
because we're impatient. */ 

while ((rem = read (fi, trkbuf, NSECT*BSIZE)) == NSECT*BSIZE) ( 
biosdisk (3, 0x80, head, cyl, 1, NSECT, trkbuf); 
if (++head == NTRACK) { 


42 



head = 0; 

if (++cyl > NCYL I I cyl > OFF_CYL+ROOTSZ ) { 

printf ("Overran root partition!\n"); 
exit (1) ; 

} 

} 

tfrent += NSECT; 

printf ("Amount transferred %5dK bytes\r", 
tfrcnt*BSIZE/1024); 

} 

/* Transfer any remainder leftover in track buffer. */ 
if (rem > BSIZE-1) { 

biosdisk (3, 0x80, head, cyl, 1, rem/BSIZE, trkbuf) ; 
tfrent += rem/BSIZE; 

printf ("Amount transferred %5dK bytes\n", 
tfrcnt*BSIZE/102 4) ; 

} 

exit (0); 

} 

[LISTING SIX] 

/* Copyright (c) 1985,1986,1989,1990 Micheal J. Karels. All rights reserved. 

* Based on a concept by Sam Leffler. Written by Michael J. Karels 4/85 

* Revised by William Jolitz 86-90. 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* Each disk has a label which includes information about the hardware 

* disk geometry, filesystem partitions, and drive specific information. 

* The label is in block 1, offset from the beginning to leave room 

* for a bootstrap, etc. 

*/ 

♦define LABELSECTOR 1 /* sector containing label */ 

♦define LABELOFFSET (BSIZE-120) /* offset of label in sector */ 

♦define DISKMAGIC Oxabc /* The disk magic number */ 

♦define DTYPE_ST506 1 /* ST506 Winchester */ 

♦define DTYPE_FLOPPY 2 /* 5-1/4" minifloppy */ 

♦define DTYPE_SCSI 3 /* SCSI Direct Access Device */ 

struct disklabel { 

short dk_magic; /* the magic number */ 

short dk_type; /* drive type */ 

struct dcon { 

short dc_secsize; /* # of bytes per sector */ 

short dc_nsectors; /* ♦ of sectors per track */ 

short dc_ntracks; /* ♦ of tracks per cylinder */ 

short dc_ncylinders; /* ♦ of cylinders per unit */ 

long dc_secpercyl; /* ♦ of sectors per cylinder */ 

long dc_secperunit; /* # of sectors per unit */ 

long dc_drivedata[4]; /* drive-type specific information */ 


43 



} dc; 

struct dpart { 

long nblocks; 
long cyloff; 

} dk_partition[8] ; 
char dk_name[16]; 


/* the partition table */ 

/* number of sectors in partition */ 

/* starting cylinder for partition */ 

/* pack identifier */ 


#define 
♦define 
♦define 
♦define 
♦define 
♦define 


dk_secsize 

dk_nsectors 

dk_ntracks 

dk_ncylinders 

dk_secpercyl 

dk_secperunit 


dc.dc_secsize 
dc.dc_nsectors 
dc.dc_ntracks 

dc.dc_ncylinders 
dc.dc_secpercyl 
dc.dc_secperunit 


/* Drive data for ST506. */ 

♦define dk_precompcyl dc.dc_drivedata[0] 

♦define dk_ecc dc.dc_drivedata[1] /* used only when formatting */ 

♦define dk_gap3 dc.dc_drivedata[2] /* used only when formatting */ 


/* Drive data for SCSI */ 

♦define dk_blind dc.dc_drivedata[0] /* can we work in "blind" i/o */ 


[LISTING SEVEN] 

/* Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 7/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* This program copies a MSDOS file to BSD's idea of a swap partition, 

* known to be the second one in the disklabel. Typical use is to place 

* a TAR formatted file, obtained from a cross-host, onto swap. Then 

* BSD is booted with the boot program and the BSD tar utility is 

* used to extract the files being transferred within the TAR image, 

* hopefully before we need to page on the swap space. Again, this 

* program is rude in requiring one to adjust the manifest constant 

* denoting the cylinder on which the BSD root filesystem appears, 

* but this is another throw-away program to get the real work started. 

* Currently works with TURBO C 1.5 . 

*/ 

♦include #include ♦include iinclude ♦include "diskl.h" 

/* Location of root partition. Adjust to suit given drive partition layout. */ 
♦define OFF_CYL 290 /* Cylinder offset of start of BSD root partition */ 

char *trkbuf; 

♦define BSIZE 512 
struct label_blk { 

char bufr[LABELOFFSET] ; 
struct disklabel dl; 

} lbl; 


44 



main(argc, argv) char *argv[]; { 
int fi, rem, cyl, head, tfrcnt; 

int bsize, ncyl, ntrack, nsect, off_cyl, maxcyl; 

if (argc != 2) { 

printf("usage: cpsw \n"); 
exit (1) ; 

} 

fi = open(argv[1], 0_BINARY); 
if (fi < 0) { 

printf("Cannot open \"%s\" file to BSD swap\n", 
argv[1]); 
exit (1) ; 

} 

/* check for presence of disklabel */ 

biosdisk(2, 0x80, 0, OFF_CYL, LABELSECTOR, 1, &lbl); 

if (lbl.dl.dk_magic != Oxabc) ( 

printf("BSD root disk partition does not have a label!\n"); 
exit (1) ; 

} 


/* Extract disk geometry and swap partition location from disk label. */ 
bsize = lbl.dl.dk_secsize; 
nsect = lbl.dl.dk_nsectors; 
ntrack = lbl.dl.dk_ntracks; 
off_cyl = lbl.dl.dk_partition[1].cyloff; 
maxcyl = lbl.dl.dk_partition[1].cyloff + 

lbl.dl.dk_partition[1].nblocks / lbl.dl.dk_secpercyl; 

/* Allocate track buffer */ 
trkbuf = malloc (nsect*bsize); 

printf("WARNING! About to overwrite disk (will loose previous\n"); 
printf ("contents) . Are you certain of your use of this program?"); 
if (getcheO != 'y') exit(l); 
printf("\n"); 

tfrcnt = head = 0; 
cyl = off_cyl; 

/* Transfer file to absolute disk section, a track at a time, 
because we're impatient. */ 

while ((rem = read(fi, trkbuf, nsect*bsize)) == nsect*bsize) { 
biosdisk(3, 0x80, head, cyl, 1, nsect, trkbuf); 
if (++head == ntrack) { 
head = 0; 

if (++cyl > maxcyl) { 

printf("Overran swap partition!\n"); 
exit (1) ; 

} 

} 

tfrcnt += nsect; 

printf("Amount transferred %5dK bytes\r", 
tfrcnt*BSIZE/1024); 


45 



} 

/* Transfer any remainder leftover */ 
if (rem > BSIZE-1) { 

biosdisk(3, 0x80, head, cyl, 1, rem/bsize, trkbuf); 
tfrcnt += rem/bsize; 

printf("Amount transferred %5dK bytes\n", 
tfront*bsize/1024); 

} 

exit (0); 

} 

Porting Unix To The 386: The Standalone System 

Using the protected mode program loader, a minimal 80386 protected mode standalone C programming 
environment for operating systems kernel development is created. 

Creating a protected-mode standalone C programming environment 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9 BSD and the chief architect of National Semiconductor’s 
GENIX. Lynne established TeleMuse, a market research firm specializing in the telecommunications and 
electronics industry. They can be contacted via e-mail at william@ berkeley.edu or at uunet! william. 
Copyright (c) 1990 TeleMuse. 

This is the third article in this series, and at this point we feel it is important to examine just what we have 
accomplished so far. In our first article, we arrived at e sentially a "plan of action," outlining what we 
understand to be the important goals of our project, as well as discussing (as always, in hindsight) some of 
the important technical decisions made in the process of completing our successful port of BSD to the 
80386. In the second article, we wrote three programs (using Turbo C and MASM) to prepare our host for 
the beginnings of this port by creating the basic tools. We are now at the point of departure, where the goal 
itself can become all-consuming. 

"Why all the drama?" one may ask. Well, what we are about to embark on may be considered, in a sense, 
like a mountaineering expedition to K2. We have done all the scheduling and planning and assembled the 
consumables and equipment. Now, here we sit at the base of the mountain, staring up at its intimidating 
peak and contemplating our first steps with both anticipation and dread. Projects of great complexity are 
always uncertain. 

In this case, our mountain is an empty 386 residing in protected mode. There is not one shred of code that 
we can rely on. One false step can cause a spontaneous reset, or worse yet, a hang. Please believe us when 
we say that it takes a lot of courage to take on such projects. Now one must shrug one’s shoulders of any 
uncertainty and begin to place one foot in front of the other and scale the foothills. We must establish our 
base camp from which we can explore further. 


46 



In this article, we endeavor to scale those foothills and establish our base camp by building upon our 
previous work; using our protected-mode program loader, we can create a minimal 80386 protected-mode 
standalone C programming environment for operating systems kernel development work. Next, we must 
write prototype code for various kernel hardware support facilities. Finally, we must use our standalone 
programming environment as a test bed to shake out the bugs in our first-cut implementation of kernel 386 
machine-dependent code in preparation for incorporation in the BSD kernel. 

At this point, the neophyte tends to ask the question (and it is a good question, mind you) "Why spend so 
much time on small programs, prototype code, and the like, instead of getting into the hard stuff?" Yes, it 
does appear on the surface that one should start shoveling this huge operating system through the compiler 
and onto our host. (At this very instant of writing, the BSD kernel consists of 128,332 lines, according to 
wc -1, and supports roughly 150 Mbytes of user-level sources — sorry, can’t wait to consult wc on that!) 

Besides being a bit of a bore, just what would happen if we jumped into compiling code willy-nilly? In 
sum, it would be a complete disaster, as we would spread all of our latent bugs and misconceptions over a 
much broader body of code. Worse yet, all these different bugs would be well-distributed throughout our 
code and hence not easily differentiated or ordered. 

A simple beginning affords us the chance to find various bugs early, when the problem still has a chance 
of being resolved to our satisfaction. We have plenty of land mines to avoid as it is, without adding to our 
troubles. 

Watching for Land Mines 

Porting 386BSD was definitely an eerie return to the basics; using the protected-mode loader, we 
bulldozed MS-DOS right off the top and were left with an empty machine that we incrementally built up 
to a functioning UNIX system. At this stage in the port, an unrealizable or nondeterministic bug is a very 
real and costly possibility that can stall a porting project for months. 

At this point, a crafty programmer can use subtle techniques which anticipate the sources of error and 
enable them to be identified and corrected in a predictable and orderly manner. In a famous discussion, 
Donald E. Knuth wrote of how he was able to greatly reduce the time it took to debug a new compiler by 
anticipating worst-case test conditions and "stress testing" by use of the adversary method. We employ 
similar techniques by testing the mechanisms used in the kernel separately from the vast body of code that 
is the kernel. This also has the added benefit of differentiating problems in the code from compiler and 
assembler bugs that are almost certain to be present. This is not a method that guarantees success, but it is 
much better to seek out trouble rather than wait for it to find you. 

Serious thought was given to implementing a tiny debugger to facilitate this stage of the port. PC 
debuggers were also examined as a possible tool to ease this effort. However, it was determined that the 
effort to keep the tools concurrent with the rest of the project would be too costly for too little advantage. 
This proved, in retrospect, to be the correct strategy for 386BSD, since most of the bugs we encountered 
were either inherent to our naive assumptions regarding the 386 instruction set or to silent "features" of the 
particular version of the GNU GAS assembler used, and as such would have affected the debugger as well 
as the operating system kernel we were porting. However, an appropriate debugging tool might have cut 
our development time and would have been especially useful if we were contemplating many ports to 
other architectures. 


47 



In practice, an appropriate tool for kernel debugging should afford little impact on the absolute 
environment. It should allow for source-level debugging (now considered a bare minimum requirement) 
and should leverage existing development platforms as much as possible. Van Jacobsen’s "kernel GDB," 
for example, is derived from GNU’s GDB debugger. It uses a small stub routine in the kernel and a serial 
line to communicate with a cross-host running the debugger. Other debuggers probably exist that exhibit 
these qualities, but we are unaware of them. 

With this article, our port is moved from the conceptual to the tangible world. This discussion, while by no 
means complete, addresses many of the mechanisms necessary for 386BSD kernel functionality. It is 
tantalizing to watch as the basic mechanisms come together, and one tends to think of what remains as 
mere bookkeeping. If this were true, operating systems programming techniques would more closely 
resemble those used by applications programmers. They do not, however. While it may appear that the end 
of the project is nearing, what is actually occurring is merely the first battle of a long and involved war. 
Again, to use our mountain climbing analogy, we clear the foothills only to be faced with the mountain 
range itself. In other words, we are continually challenged with a new class of obstacles. 

The First Step 

At this point, there is little confidence in any of our tools because we have yet to actually "shake down" 
the absolute loader, assembler, and link editor. Beginning with trivial programs of a few instructions and 
gradually expanding them, we incrementally prove our tools to the point where we can use them with 
some degree of confidence. The journey begins, as always, with a single step. 


More Details. 


The program in |Listing One| is the simplest protected-mode program we can write that generates output on 
the screen. It displays the message "hi" midscreen and then stops. A program this simple must always 
work. If it does not, it presents a minimum number of possibilities to determine why it fails. During our 
port, this program originally did not work, due to bugs in the early loader and assembler. While this may 
seem trite to some, this program illustrates the pathetic level at which untested software tools begin. After 
eliminating a handful of nuisance bugs, this simple program did work, and it proved valuable because it 
was able to smoke out bugs quickly. 


A side note to those who may have noticed that our assembly code format seems to have changed since the 
previous article when we used Microsoft’s MASM: For those unaware, UNIX 386 assemblers prefer the 
operands in the opposite order, partly because early UNIX systems appeared on PDP-lls, which preferred 
this ordering style. Thus, on MS-DOS with MASM: 

mov eax,edx /move contents of edx into eax register 


corresponds to the UNIX assembler format: 

movl %edx,%eax Imove contents of edx into eax register 

In other words, it is (destination, source) instead of (source, destination). This is yet another stunning 
"improvement" in the field of computer languages, destined to be appreciated by those simultaneously 
debugging a MASM-coded bootstrap loader and code generated by the GAS UNIX assembler! 


48 




As we proceed further, we add more complexity, testing span-dependent jumps, stacks, and other 
mechanisms. Listing Two is a more elaborate program which sends character and string output functions 
to the screen, thus allowing for a primitive degree of debugging. Listing Three contains a simple runtime 
start-off for C, with the obligatory "hello worldO" program heralding our arrival into serious programming 
mode. At this point, we’ve found most of the silly bugs and also created a primitive debugging tool. One 
might even claim that, through this method, our entire BSD UNIX system is derived from our original 
two-instruction program that we stalled with! 

Introducing the Standalone System 

The next milestone on our path was to produce, debug, and test a library of support routines written in 
absolute protected-mode code. These routines allow us to write the GCC programs needed to implement 
386 machine-dependent code, to access devices, and to access UNIX file structures on the hard disk. For 
the code in Listing Three to function, a library is required to fill out all of the primitives invoked. 

This library and corresponding programs constitutes a standalone system of a kind, and it affords us an 
opportunity to write a minimal amount of machine-dependent code outlining our basic structure before we 
commit to massive coding. It is a minimal C environment at best, but more than enough for us to 
implement and test things like exception catching, system call handling, line clock interrupts, and so forth. 
As we begin our climb, we are able to expand a toe hold into a foot hold. 

The standalone system actually consists of assembly language programs for runtime start-off and 
processor support (module srt.s), as well as machine-dependent C code for device support (many modules, 
including kbd.c and cga.c) and machine-independent C code for language support, formatted output, and 
filesystem operations (modules prf.c and sys.c). With the standalone system, a file can be read or written 
from a BSD filesystem on a disk drive. 

The BSD standalone system is not intended quite for this purpose; instead, it’s used to bootstrap load the 
system from disk or tape as part of the process of initializing the computer to run the BSD system. Since 
we don’t yet have an operable kernel to be loaded and we’ve already written a MS-DOS program loader 
(see DDJ, February 1991), the standalone system is not really of use to us yet. However, the standalone 
system also provides us with file I/O, formatted output, and a structure to hang hardware drivers on, while 
demanding little from the hardware for support. Thus, we can use the standalone system to prototype code 
for the kernel, with the added dividend of completing the bootstrap code required by the complete kernel. 

To run this minimal system, only the simplest of keyboard, display, and hard disk device drivers are 
required. These can be enhanced later as needed. 

Keyboard Driver 

|Listing Four outlines code for a simple driver, which extracts ASCII characters from the PC keyboard on 
demand by grabbing display codes from the 8042 keyboard interface, consulting a table of actions for the 
given key press display code, and returning the appropriate value out of the key table. It does not even 
bother to initialize the keyboard controller, since we know that MS-DOS already did that for us before we 
loaded the program with our absolute loader. 


49 





This is the first place where we are hit with variations in PC keyboard interfaces, all of which are hidden 
from applications programs and MS-DOS by appropriate BIOS ROM drivers. It is possible to dance back 
and forth between real and protected mode (thankfully made easier on the 386 than was the case on the 
286), "translating" the BIOS calls into BSD driver requests. This method was intended for the PC, if one 
examines its real-mode ancestry, and also addresses a nest of manufacturer idiosyncrasies. However, it 
goes against the grain of our project in three basic ways: 1. performance degrades in getting away from 
direct interaction with the hardware; 2. incompatibility with previous BSD systems develops; and 3. 
implementation becomes a bigger project than the port itself. In addition, it perpetuates the intertwining of 
MS-DOS and UNIX to the point where it becomes a significant future liability. To resolve this dilemma, 
we must choose to support the "raw" machine in its entirety, with the result that undocumented or "secret" 
proprietary hardware features must be ignored. This is not as great a burden as it may first appear, because 
a considerable body of code already exists for this purpose, and the great bulk of 386 AT platforms 
already conform to de facto hardware standards. 

Display Adapter 

In |Listing Five| , we can examine the code from a trivial "glass tty" terminal emulator for the display, 
which in this case happens to be a CGA board. We can be content at the moment with newline, carriage 
return, and tab functions, since we do not intend to do anything other than line-oriented text output in the 
standalone system. Scrolling, by far, is the most complicated feature. 

Our decision to avoid the BIOS at this point does make things more difficult, because the BIOS 
automatically configures in device-dependent code from ROMs onboard the display card to support the 
given device. Fortunately, market forces have kept the proliferation of variations down to a reasonable 
number, with either MDA or CGA interface standards supported by practically all boards. Up to the point 
where we must support X Windows, we can live with probing to determine the display type and "hard 
coding" for each. 

Prevaricating with the Standalone System 

The standalone system also provides us with a test bed for trying many different ideas which can satisfy 
the mechanisms used in the BSD operating system kernel, for we can then selectively test these 
mechanisms individually. Otherwise, we would be forced to test them all together within the operating 
system. Thus, as we vary our approach, we can determine whether each method satisfies our basic 
specification conditions and whether implementation is feasible for our project. Over the course of this 
project, the support strategies for device-interrupts configuration and process context-switching changed 
drastically as we began to notice the degree of difference between porting BSD to a 386 and porting BSD 
to more "conventional" architectures. In fact, we were still using the standalone system to find unintended 
interactions in our 386 hardware-features support code long after we had 386BSD self-compiling. Another 
valuable aspect of this test bed is we can benchmark competitive solutions to the same kernel support 
mechanism sans other interactions. This was useful in selecting appropriate context switch, interrupt 
control, and virtual memory system code. 


50 




Extending the Standalone System 

On top of the standalone system framework (which really requires very little processor-dependent support) 
we can write and test portions of code for the operating system kernel (which requires quite a lot of 
processor-dependent support). In the following sections of this article, we will discuss some extensions to 
the standalone system which add kernel functionality. Processor support for the kernel reflects support for 
memory protection of 386 "rings," ring crossings, and address space translations among other needs (see 
the accompanying box "Brief Notes: 386 Rings"). 

These extensions are not required for the standalone system to function, but they are not only used to test 
the kernel code, but actually form the basis for the prototype kernel code. In essence, the standalone 
system can be viewed as if it were the kernel itself, or possibly even a nano-kernel! 

Processor Support — i386.c 

Within the i386.c module appears the code and data structures needed to "wire-down" most of the 386’s 
processor structures (descriptors, exceptions, task switch state). init386() is a subroutine that "fills in the 
blanks" and test386() tests portions of the mechanisms we will need to run our BSD UNIX system. Note 
that this creates a superficial test bed that does not entirely address our intended system, as user and kernel 
mode not only share address space, they are the same program! 

We start first by initializing paging ( Listing Nine| ). The next fragment contains code which enables 
paging by building a set of page tables and page directory. For this example, we map virtual addresses to 
correspond with physical addresses identically, and allow the first 4 Mbytes of physical memory to be 
referenced "read/write" by both user and supervisor (kernel) rings. It is important to remember that while 
the processor’s instructions work through the paging MMU with virtual addresses, the addresses that the 
MMU uses to consult page directory and page tables are all physical addresses. These physical addresses 
do not always correspond to the virtual addresses that the processor uses, unlike this example where 
virtual addresses are mapped one for one. As a result, when modifying the page tables and page directory 
the kernel must explicitly convert any virtual addresses used to physical. 

Another point to mention about this paging mechanism is that the page tables and page directory 
themselves need to be mapped to a given virtual address so that the kernel may modify them to change 
address translation on demand. An oddity of this paging mechanism is that it can work even if the page 
tables are completely inaccessible to the kernel in its virtual address space. This would be inconvenient for 
the kernel, however, as it spends a great deal of time modifying these structures already. 

Two assembly language helper routines lcrOQ and lcr3() allow us to set the 386’s processor control and 
page directory base register, respectively. Since we are already running "protected," the lcrO() simply 
overwrites the already set protect-mode bit as well as the paging-mode bit, allowing the MMU to enter 
into paging mode. 

Our page tables and directory as encoded here provide a null address mapping, so that there is, as yet, no 
effective difference in address translation. One might wonder why we must do this. If we don’t, several 
subtle problems arise. For example, if the address mapping of the instructions we are executing were to 
differ, the 386’s view of which instruction was to be executed next might no longer match the next 
assembled instruction the program should have executed. Both must be changed synchronously. Worse, if 
the 386 has an instruction queue fetching asynchronously, we may not be able to predict exactly when the 


51 




transition occurred. The safest way to avoid these problems is to enable page mapping with no net 
translation, then modify the address mapping after the processor is running on the "identity" map. We can 
then arrange to flush our various processor instruction queues and MMU address translation buffers before 
allowing the processor to execute instructions in a "translated" portion of the address space. 

Besides paging, we must reinitialize segmentation. We start by "flattening" the 386 with our descriptor 
tables. On the 386 (see Listing Six ), our Global Descriptor Table (GDT) describes address space selectors 
that will have global visibility within our BSD kernel such that all processes will see them. Kernel address 
space requires a descriptor for instructions and data, as well as a task gate used to switch processes 
through, and various task state descriptors used to save and restore state on demand. The kernel has a 
"panic" task state reserved to be used when catching certain exceptions that require an "known good" task 
state. 

For the address space selectors used in user processes, we have the Local Descriptor Table (LDT). We can 
use, potentially, one per process. These descriptors, as the name suggests, are private to each process, and 
describe the memory segments of that process. In addition, we have "gates." We need to use only one to 
call the system. 

Descriptors come in many different flavors (see Listing Seven ): Those that refer to memory or system 
data structures directly, and gates that indirectly refer to other memory segments. We use task gates to 
generically switch to the next consecutive task state, and call gates to allow us to enter the kernel’s global 
code segment in a system call. Gates get their name from the controlled fashion that they regulate ring 
crossings, again from the MULTICS heritage. 

Actually our coverage of descriptors is not yet complete. We have hidden descriptors as well that serve 
special functions. Interrupts and exceptions on the 386 index yet another descriptor table, the Interrupt 
Descriptor Table (IDT). No program code can call these gate descriptors. Instead, external interrupts and 
internal processor exceptions transfer through these gate descriptors. We also use a kind of "meta 
descriptor" called a "region descriptor," which is used to describe descriptor tables so that we can load 
them via appropriate instructions. So much for the cast of players in this descriptor drama. 

Because the actual descriptor encoding is somewhat obscure (it was meant to be reverse-compatible with 
the 286), we chose to refer to the descriptor by having a subroutine shuffle our software descriptors into 
appropriate form when presented to the hardware for use. In Listing Six local and global tables are filled 
out by translating them into hardware form and loading them with a lgdt(), lidt() function. We do this, 
even through we are already in protected mode, to provide this newer version of the descriptor tables that 
we wish to use. The function lgdt() hides some characteristics of the 386 segmentation from view, because 
when we reload the GDT (we are running using active GDT descriptors), we need to flush instruction 
prefetch and reload all kernel descriptors. This insures proper code execution. We then reload the CS 
register by turning the normal intrasegment return into a intersegment return. 

In the case of our IDT table, we use a subroutine, setgate() (see Listing Eight ) to build interrupt gate 
descriptors that will enter the system’s global code descriptor at special assembler stub routines. Each is 
referred to by a special naming convention hidden by the IDTVECQ macro that catches the exception or 
interrupt. With all of these descriptor tables loaded and in place, the 386 now has complete information 
describing the legitimate references to RAM memory by user programs, the operating system kernel, and 
hardware-accessed data structures. Exceptions, including incorrect references to memory, will also be 
caught and directed to appropriate code. 


52 




One virtue of this complicated scheme of descriptors and segments is that it is possible to add new 
microprocessor features by simply adding new descriptor types. The mechanism is now general enough to 
support a wide variety of data objects in a consistent way. 

Initial Task State Load 

Even if we don’t wish to use the 386’s special context switch feature, we must initialize a root task state. 
Why? Because once we are in a user-mode process, only the TSS (Task State Segment) contains the 
information on where the stack is in the supervisor (kernel) processor ring. 

Interestingly, the processor will indeed go into user mode, functioning just fine until a trap, interrupt, or 
system call occurs. Most other processors have dedicated register sets to locate the kernel stack in these 
cases. But the 386 designers conceptualized ring crossings (user <-> kernel mode) like that of task 
switches. Thus, we include the supervisor "entry into ring" stack pointer in the TSS. 

In the TSS structure (see Listing Ten ), we assign a kernel/supervisor stack top well below our current 
stack to avoid conflicts. We select this as the current task segment, and then use a little trick to fill out the 
remainder of this large structure by arranging to context switch back to our task segment, using our 
assembly language stub jmptss. jmptss always saves the task state of the current task and then loads the 
state yet again. Because the new state must not already be BUSY in order to use this trick, we force it to 
be AVAILABLE. UNIX kernels use a function, called resume(), to provide for this mechanism. However, 
jmptss also provides for context switching when we wish to transfer from one process or task to the next 
one. The general case, when we call jmptss with a new TSS selector, not the current one, will be covered 
in a later article. 

Trap Handling 

Earlier in this article we alluded to the 386 exception/trap’s assembler stubs routines. Now that we have 
enough 386 support in place, we can describe trap handling and the mechanism by which these stubs 
reflect the trap event into the C trap() handling function. Listing Twelve contains code for a sample trap, 
in this case a breakpoint or INT3 instruction. Assembly language stubs in module srt.s are executed by the 
processor in response to receiving a trap or interrupt that selects the corresponding IDT entry. 

These stubs are the minimal glue that index each kind of trap with a manifest constant. This constant is 
always of the form T_XXXX and is obtained from the file trap.h (see Listing Lourteen ). Some traps on 
the 386 also place an error code word on the stack, in order to transfer additional information about the 
cause of the trap. Because we need to ultimately remove this error code before returning from the trap, we 
first make all traps appear alike by pushing a dummy error code of zero on the stack for traps without error 
codes. Then, the common code that returns after the trap has only to remove both the trap constant and 
error code, regardless of which trap occurred. 

After saving the processor state, all trap stubs call common code, which calls the C language trap handler; 
they also have code following, which restores the state and returns to whatever code was running when the 
trap occurred. The C language handler merely notifies us of the processor state and exception type and 
then returns. Since our test function will test different traps in a sequence, we prefer to bypass faults that 
don’t move the program counter. We do this by manually incrementing the program counter, knowing that 
all faults we intend to encounter happen to be 1 byte in size. Obviously, this convenience doesn’t hold for 
the BSD kernel, but it is satisfactory for the moment. 


53 





Interrupt Handling 


Interrupts on the 386 are a kind of trap that function much like the exceptions we discussed. In |Listing | 
Twelvej , the AT’s interrupt control units and interrupt timer are initialized to allow hardware interrupts to 
be signaled from AT devices, such as timer to processor. As a minor point, we clear the coprocessor’s 
exception interrupt to avoid spotting a possibly spurious interrupt from some preexisting condition formed 
in MS-DOS mode prior to running the standalone program. 


Next, our interrupt test enables the processor and interrupt control unit for a brief period of time, allowing 
hardware interrupts to be processed by the 386. Any interrupts occurring during that interval will cause the 


386 to extract the appropriate IDT entry from the table and cause one of the assembly stubs in |Listing 


Twelvejto be executed. These stubs, like the trap stubs, save the state, record the interrupt index on the 
stack, and call the common C function intr(). 


In intr(), the present interrupt is masked off and the interrupts are then reenabled so other interrupts can be 
active while the received interrupt is processed. Note that both the stubs and this function are fully 
reentrant, as this is not an uncommon occurrence. The example code also provides some trivial interrupt 
actions for our timer, keyboard, and any other device that generates an interrupt. At this point, we dispatch 
to a specific device driver’s interrupt routine. 

After responding to the interrupt, we restore our old mask in an uninterruptable "critical section," and 
signal the interrupt control unit that this interrupt is to be acknowledged as "finished." Our interrupt stub 
then unwinds the stack, restores the state as needed, and returns us to the exact state we were in prior to 
processing the interrupt signaled to the 386. 


System Call Handling 

To test system calls, we must first enter user mode by generating an outbound ring crossing. The touser() 
function (see Listing Thirteen ) does this by switching to our previously set-up user ring address space 
found in the LDT and enters user mode in the function named usercode(). (The LDT, by a remarkable 
coincidence, exactly corresponds to our standalone system’s "kernel" address space!) A special kind of 
stack frame is built that imitates the 386’s inbound ring-crossing processing. In other words, we "fake out" 
the processor into believing it has just come from user mode. This done, we calmly return with an 
intersegment return, executing in the new mode at the beginning of the function. 


usercodeQ does not tarry in the user ring for long; it immediately calls the system call gate previously set 
up in the LDT. This call gate regulates entry into the kernel at location IDTVEC (syscall), which in turn 
calls the C function syscallQ to properly enter the kernel. 


Normally, at the end of the system call assembly stub we would return to the user ring program, but since 
we have concluded testing the user ring transition, we instead return to the touserpQ function caller via a 
nonlocal goto. We have carefully preserved the stack frame further up the stack, so we can test other parts 
of the kernel mechanism in the test386() function. 


54 





Page Fault Handling 


Our final mechanism demonstrated here involves generating a page fault, a rather common occurrence in 
our BSD UNIX system. These faults, caught by the mechanism described earlier, are trap type 
T_PAGEFLT, and they end up in the trap handler trap(). In Listing Eleven , notice that this function prints 
the contents of the 386 special processor register cr2 on a trap. This register records the address value 
causing the page fault trap. Eventually, the virtual memory system will require this in order to determine 
which page is being requested by the program being run and if this page should be made accessible. In this 
case, it obviously is not accessible. 

To generate a page fault, code in module i386.c (see |Listing Fifteen ) first reads and then writes an address 
outside of the range of valid page table and directory structures. If this address had been outside of the 
range of the current segment descriptor, it would have generated a general-protection fault before being 
flagged by the paging unit. (In the scheme of things, segmentation is ahead of paging.) But the address 
invoked is well within the range of the segment descriptor, and only the paging unit takes issue with it. 

Our page table mapping validates the first page of page tables (the first 4 Mbytes of address space) even 
though all the other page table pages are not present. However, because the page table directory entry in 
this case is actually invalid, the 386 MMU balks on address 0x800000 and signals trap type T_PAGE-FLT 
to trap(). 

We can determine the type of page fault from the error code of the trap, which in turn tells us whether a 
"read," "write," or "protection violation" condition occurred. With the 386, we can also restrict pages to 
"supervisor only" access, thus keeping them out of the hands of any nosy user programs. It is interesting to 
note that while user programs can write-protect pages of memory (typically when used for instruction 
segments), the kernel (running in the supervisor ring) does not have the same option since the 386 ignores 
the "write protect" control on paging. While this is not needed for UNIX to function, we would like to 
make parts of the kernel "read-only" to catch unintended modifications by undiscovered bugs in the 
kernel. We just can’t do this on the 386. 

Where Do We Go From Here? 

In the first examination of our initial utilities, we discussed several items of importance in our standalone 
system /sys/stand, some of its utilities, and a library of support routines. Through the standalone system, 
we were able to use GCC programs to access devices, such as the keyboard and display, as well as UNIX 
file structures on the hard disk. It also provided us with a platform to examine the 386’s requirements 
through extensions which supported features incorporated into our UNIX port, and could also be used as a 
test bed for some of these functions. As we stated earlier, the standalone system can be viewed at this 
stage as if it were the kernel itself, with the extensions the basis of our prototype kernel code. We have 
started up the base of the mountain. 

Next time, we will proceed further with our initial utilities development, by creating a stable cross-tools 
environment. Kerrnit and NCSA telnet will be used to load files and program over Ethernet and serial 
lines. We will then focus on proving GCC itself valid for cross-support purposes, as well as the limitations 
and alternatives. 


55 




Brief Notes: 386 Rings 

A "ring" is a concept developed in the early days of large-scale timesharing by those working on the 
MULTICS operating system (see The Multics Systems: An Examination of Its Structure, Elliot I. 
Organick, MIT Press: Cambridge Mass., 1972). These rings establish a hierarchy of memory protection 
and processor function, in which code running in lesser-valued rings has access only to all higher-valued 
or equally valued rings. A protection violation occurs when less "secure" code (running in a higher-valued 
ring) accesses more "secure" code or data (at a lower-valued ring). 

Rings can be used to regulate access and determine if a protection violation occurs. 

Support for the multiple ring protection model of the 386 occurs in four distinct rings of protection (0-3). 
Traditional UNIX supports only two rings: One for the kernel operating system, and one for the current 
user process. On the 386, this corresponds to the supervisor ring (0) and the user ring (3). When the 386 is 
running in the user ring and receives an interrupt or exception, or it needs to process a system call, it must 
switch rings to the supervisor ring. Once there, the kernel is run as a trusted program to handle these 
events. This switching of rings, or "ring crossing," is central to the UNIX memory protection model. 
Unlike MS-DOS, where operating system code and user application code mingle in the same address 
space, UNIX programs run in "hard" shells of address space. UNIX programs are not able to modify each 
other or the operating system by virtue of memory protection. This is enforced by memory protection 
hardware on the microprocessor in the general case when the applications program is running. Ring 
crossing, where we go from user to kernel code, needs to be carefully done to preserve the protection 
model in all cases. In this way, nothing that a user program does can possibly affect another user program 
adversely or "crash" the operating system. 

—B.J., L.J. 

[LISTING ONE] 

# hi.s: Simplest protected mode program providing some kind of output. 

. text 

start: movl $0x0e690e48, 0x0b8800 # put "hi" mid screen on display 

hit 

[LISTING TWO] 

# hello.s: Minimal test of GNU GAS assembler, handles CGA & strings. 

. text 
start: 

movl $0xA0000,%esp 

pushl $str 
call _puts 
pop %eax 
hit 

str: .asciz "\n\rHello world from GAS\r\n" 

_puts: 

push %ebx 

movl 8(%esp),%ebx 

1: cmpb $0,(%ebx) # until we see a null 

je 2f 


56 



movzbl (%ebx),%eax 
pushl %eax 

call _putchar # put out characters 
popl %eax 

incl %ebx 
jmp lb 

2: popl %ebx 

ret 

crtat: .long 0xb8000 # address of CGA video RAM 

row: .long 0 

_putchar: 

movzbl 4(%esp),%eax 
push %ebx 
push %ecx 
movl crtat,%ebx 

cmpl $0xb8000+80*25*2,%ebx # continous output off screen edge & bot 

jl If 

movl $0,row 

movl $0xb8000+80*(25-1)*2,%ebx 
1: cmpb $0xd,%al # cr 

jne If 

movl $80,%ecx # clear rest of line 

subl row,%ecx 

movl %ebx,%edi 

movw $0xfff,%ax 

cld 

rep 

stosw 

subl row,%ebx 
subl row,%ebx 
movl $0,row 
jmp 9f 

1: cmpb $0xa,%al # nl 

jne 2f 

cmpl $0xb8000+80*(25-1)*2,%ebx # scroll? 
jl If 

movl $0xb8000,%edi # scroll page 

movl $0xb8000+80*2,%esi 

movl $80*(25-1),%ecx 

cld 

rep 

movsw 

movl $80,%ecx # clear new bottom line 

movl $0xb8000+80*(25-1)*2,%edi 

movw $0,%ax 

rep 

stosw 

sub $80*2,%ebx # position cursor before If 

1: add $80*2,%ebx 

jmp 9f 

2: orw $0x0e00,%ax # attribute 

movw %ax,(%ebx) 
addl $2,%ebx 
incl row 


57 



9: movl %ebx,crtat 

pop %ecx 
pop %ebx 
ret 


[LISTING THREE] 

/* [Excerpted from srt.s] */ 


entry: .globl entry 

jmp If 

.space 0x500 /* skip over BIOS data area */ 

1: cli /* no interrupts yet */ 


movl 

$0xA0000,%esp 

movl 

%esp,%edx 

movl 

$_edata,%eax 

subl 

%eax,%edx / 

pushl 

%edx 

pushl 

%eax 

call 

_bzero 

popl 

%eax 

popl 

%eax 

call 

_main 


clear stack and heap store 


*/ 


/* hello.c */ 

main() { printf("Hello, world!\n"); } 

[LISTING FOUR] 


/* kbd.c: Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 9/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* Standalone driver for IBM PC keyboards. 

*/ 


#define 

L 

0x001 / 

* locking function 

*/ 



♦define 

SHF 

0x002 

/* 

keyboard shift 

*/ 



♦define 

ALT 

0x004 

/* 

alternate shift 

— alternate 

chars 

*/ 

♦define 

NUM 

0x008 

/* 

numeric shift 

cursors vs. numeric 

*/ 

♦define 

CTL 

0x010 

/* 

control shift 

— allows ctl 

function 

♦define 

CPS 

0x020 

/* 

caps shift — swaps case of 

letter 

*/ 

♦define 

ASCII 

0x040 


/* ascii code for this key */ 



♦define 

STP 

0x080 

/* 

stop output */ 




♦define 

BREAK 

0x100 


/* key breaking 

contact */ 




typedef unsigned char u_char; 


58 



u_char 

inb () ; 












u_char 

action[ 

] = { 











0, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

0- 7 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

8-15 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

16-23 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

CTL, 

ASCII, 

ASCII, 

/* 

scan 

24-31 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

32-39 

*/ 

ASCII, 

ASCII, 

SHF , 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

40-47 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

SHF, 

ASCII, 

/* 

scan 

48-55 

*/ 

ALT, 

ASCII, 

CPS|L, 

0, 

0, 

ASCII, 

0, 

0, 

/* 

scan 

56-63 

*/ 

0, 

0, 

0, 

0, 

0, 

NUM 

L, 

STP|L, 

ASCII, 

/* 

scan 

64-71 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

/* 

scan 

72-79 

*/ 

ASCII, 

ASCII, 

ASCII, 

ASCII, 

0, 


o, 

0, 

0, 

/* 

scan 

80-87 

*/ 

O 

o 

o 

O 

o 

o 

o 

0, 











o 

o 

o 

o 

o 

o 

o 

0, 0,0, 

0,0,0, 

0,0,0, 









o 

o 

o 

o 

o 

o 

o 

0, 0,0, 

0,0,0, 

0,0,0, 

} ; 








u_char 

unshift [ ] = { 

/* no shift 

*/ 








0, 

033 , 

'1' , 

'2' , 

'3' , 

'4' 

f 

'5' , 

'6' , 

/* 

scan 

0- 7 

*/ 

, 

'8' , 

' 9' , 

'0' , 

f f 

f 

t — r 

f 

0177 , 

'\t' , 

/* 

scan 

8-15 

*/ 

'q' , 

'w' , 

' e' , 

'r' , 

't' , 

'y' 

f 

'u' , 

'i' , 

/* 

scan 

16-23 

*/ 

'o' , 

'P' , 

' [' , 

' ] ' , 

' \r' , 

CTL 

f 

'a' , 

' s' , 

/* 

scan 

24-31 

*/ 

'd' , 

'f' , 

'g' , 

'h' , 

' j' , 

' k' 

f 

'1' , 

f . f 

f f 

/* scan 32 

:-39 */ 

'\" , 

r \ r 

f 

SHF , 

'\V , 

' z' 

' x' 

f 

' c' , 

'V' , 

/* 

scan 

40-47 

*/ 

'b' , 

'n' , 

' m' , 

f f 

f f 

f f 

• f 

'/' 

f 

SHF , 

r * r 

r 

/* 

scan 

48-55 

*/ 

ALT , 

r r 

r 

CPS|L, 

0, 

0, 

f f 

f 

0, 

o. 

/* 

scan 

56-63 

*/ 

0, 

o. 

0, 

0, 

0, 

NUM 

L, 

STP|L, 

'7' , 

/* 

scan 

64-71 

*/ 

'8', 

' 9' , 

r r 

r 

'4' , 

'5', 

' 6 

r 

r 

'+', 

'1', 

/* 

scan 

72-79 

*/ 

'2' , 

'3' , 

'O', 

f f 

• f 

0, 


o. 

0, 

0, 

/* 

scan 

80-87 

*/ 

o 

o 

o 

o 

o 

o 

o 

o. 











o 

o 

o 

o 

o 

o 

o 

Of 0,0, 

0,0,0, 

0,0,0, 









o 

o 

o 

o 

o 

o 

o 

0, 0,0, 

0,0,0, 

0,0,0, 

} ; 








u_char 

shift [ ] 

= { 

/* shift shift */ 








0, 

033 , 

' 1 f 

• f 

' @' , 

'#' , 

' $' 

f 

to,! 

° f 

f A f 

f 

/* 

scan 

0- 7 

*/ 

'&' , 

r * r 

r 

' (' , 

' ) ' , 

f f 

— f 

' +' 

f 

0177 , 

'\t' , 

/* 

scan 

8-15 

*/ 

'Q' , 

' W' 

'E' , 

'R' , 

' T' 

' Y' 

f 

'U' , 

'I' , 

/* 

scan 

16-23 

*/ 

'0' , 

'p' , 

' [' , 

' ] ' , 

' \r' , 

CTL 

f 

'A' , 

'S' , 

/* 

scan 

24-31 

*/ 

'D' , 

'F' , 

'G' , 

'H' , 

' J' , 

'K' 

f 

'L' , 

r . r 

• r 

/* 

scan 

32-39 

*/ 

r h r 

r 

r ~ r 

r 

SHF , 

'1' , 

' Z' , 

'X' 

f 

'C' , 

'V' , 

/* 

scan 

40-47 

*/ 

'B' , 

'N' 

' M' , 

'<' , 

'>' , 

' ?' 

f 

SHF , 

r -k r 

r 

/* 

scan 

48-55 

*/ 

ALT , 

r r 

r 

CPS|L, 

0, 

o. 

f f 

f 

0, 

o. 

/* 

scan 

56-63 

*/ 

0, 

o. 

0, 

0, 

o. 

NUM 

L, 

STP|L, 

'7' , 

/* 

scan 

64-71 

*/ 

'8', 

' 9' , 

f f 

f 

'4' , 

'5', 

' 6 

f 

f 

'+' , 

'1', 

/* 

scan 

72-79 

*/ 

'2' , 

'3' , 

'O', 

f f 

• f 

0, 


0, 

0, 

0, 

/* 

scan 

80-87 

*/ 

o 

o 

o 

o 

o 

o 

o 

0, 











o 

o 

o 

o 

o 

o 

o 

0, 0,0, 

0,0,0, 

0,0,0, 









o 

o 

o 

o 

o 

o 

o 

0, 0,0, 

0,0,0, 

0,0,0, 

} ; 









u_char ctl[] = { /* CTL shift */ 


59 



0 , 


033 

r 

' !' 

r 

000 

1 

' ♦' 

r 


r 0, r 

r 

036 

r 

/* 

scan 

r- 

1 

0 

*/ 

' &' 

r 

t A ’ 

r 

' (' 

r 

')' 

r 

037 

r 

'+' , 

034 

t 

' \177 

r 

r 

/* 

scan 

8-15 

*/ 

021 

r 

027 

r 

005 

r 

022 

r 

024 

r 

031 , 

025 

r 

Oil 

r 

/* 

scan 

16-23 

*/ 

017 

r 

020 

r 

033 

r 

035 

t 

, \ r . 

r 

CTL , 

001 

t 

013 

r 

/* 

scan 

24-31 

*/ 

004 

r 

006 

r 

007 

r 

010 

r 

012 

r 

013 , 

014 

r 

t . r 
r 

r 

/* 

scan 

32-39 

*/ 

,\,, 

r 

f \ t 

t 

SHF 

r 

034 

r 

032 

r 

030 , 

003 

t 

026 

r 

/* 

scan 

40-47 

*/ 

002 

r 

016 

r 

015 

r 

' <’ 

r 

' >' 

r 

, 9 , 

• r 

SHF 

r 

r * r 

r 

/* 

scan 

48-55 

*/ 

ALT 

r 

r r 

r 

CPS | 

L, 


0 , 


0 , 

r r 

r 


0 , 


0 , 

/* 

scan 

56-63 

*/ 

CPS | 

L, 


o, 


0, 


0 , 


0 , 

0 , 


0 , 


0 , 

/* 

scan 

64-71 

*/ 


0, 


o, 


0, 


0 , 


0 , 

0 , 


0 , 


0 , 

/* 

scan 

72-79 

*/ 


o. 


0, 


0, 


0 , 


0 , 

0 , 


0 , 


0 , 

/* 

scan 

80-87 

*/ 


0, 


0, 

033, 

'7' 

t 

' 4' 

r 

'i' , 


0 , 

NUM 

L, 

/* 

scan 

88-95 

*/ 

' 8' 

r 

' 5' 

r 

'2' 

r 


0 , 

STP | 

L, 

' 9' , 

' 6' 

r 

' 3' 

r 

/* 

scan 

96-103 

*/ 

r r 

r 


o, 

f * t 

r 

r _ r 

r 

' +' 

r 

0, 


0 , 


0 , 

/* 

scan : 

104-111 

*/ 

O 

O 

o. 

0, 0, 0 

,0, 

0, 0 

,0, 

0 , 0 , 

0 , 

0 , 0,0 

r 

} ; 










#define KBSTATP 0x64 /* kbd status port */ 

♦define KBS_RDY 0x02 /* kbd char ready */ 

♦define KBDATAP 0x60 /* kbd data port */ 

♦define KBSTATUSPORT 0x61 /* kbd status */ 

♦define KBD_BRK 0x80 /* key is breaking contact, not making contact */ 

♦define KBD_KEY(s) ((s) & 0x7f) /* key that has changed */ 

/* Return an ASCII character from the keyboard. */ 
u_char kbd() { 

u_char dt, act; 

static u_char odt, shfts, ctls, alts, caps, num, stp; 

do { 
do { 

while (inb(KBSTATP)&KBS_RDY) ; 
dt = inb(KBDATAP); 

} while (dt == odt); 

odt = dt; 

dt = KBD_KEY(dt); 

act = action[dt]; 

if (odt & KBD_BRK) act |= BREAK; 

/* kinds of shift keys */ 
if (act&SHF) actl (act, &shfts); 

if (act&ALT) actl (act, Salts); 

if (act&NUM) actl (act, &num) ; 

if (act&CTL) actl (act, Sctls); 

if (act&CPS) actl (act, Scaps); 

if (act&STP) actl (act, &stp); 

if (act&(ASCII|BREAK) == ASCII) { 
u_char chr; 

if (shfts) 

chr = shift[dt] ; 
else { 

if (ctls) chr = ctl[dt] ; 
else chr = unshift[dt] ; 

} 

if (caps && (chr >= 'a' && chr <= 'z')) 


60 



chr -= 'a' - 'A' ; 

return(chr); 

} 

} while (1) ; 

} 

/* Handle shift key actions */ 
actl(act, v, brk) char *v; { 

/* are we locking ... */ 
if (act&L) { 

if((act&BREAK) == 0) *v A = 1; 

/* ... or single - action ? */ 

} else 

if(act&BREAK) *v = 0; else *v = 1; 


[LISTING FIVE] 

/* cga.c: Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 9/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* Standalone driver for IBM PC Displays like CGA. 

*/ 

typedef unsigned short u_short; 
typedef unsigned char u_char; 

#define CRT_TXTADDR Crtat 
♦define COL 80 

♦define ROW 25 

♦define CHR 2 

u_short *Crtat = ((u_short *)0xb8000); /* OxbOOOO for monochrome */ 
u_short *crtat; 

u_char color = Oxe ; 
int row; 

sput(c) u_char c; { 
if (crtat == 0 ) ( 

crtat = CRT_TXTADDR; bzero (crtat,COL*ROW*CHR); 

} 

if (crtat >= (CRT_TXTADDR+COL*ROW*CHR)) { 

crtat = CRT_TXTADDR+COL*(ROW-1); row = 0; 

} 

switch(c) ( 

case ’ \t' : 
do { 

*crtat++ = (color<< 8 )| ; row++ ; 


61 



} while (row %8); 
break; 

case '\010': 

crtat—; row—; 
break; 
case '\r': 

bzero (crtat,(COL-row)*CHR) ; crtat -= row ; row = 0; 
break; 
case '\n': 

if (crtat >= CRT_TXTADDR+COL*(ROW-1)) { /* scroll */ 

bcopy(CRT_TXTADDR+COL, CRT_TXTADDR,COL*(ROW-1)*CHR); 
bzero (CRT_TXTADDR+COL*(ROW—1),COL*CHR) ; 
crtat -= COL ; 

} 

crtat += COL ; 
break; 
default: 

*crtat++ = (color<<8)| c; row++ ; 
break ; 

} 

} 

[LISTING SIX] 

/* [excerpted from i386.c] */ 

/* Descriptor Tables */ 

/* Global Descriptor Table */ 

♦define GNULL_SEL 0 /* Null Descriptor - obligatory */ 

♦define GCODE_SEL 1 /* Kernel Code Descriptor */ 

♦define GDATA_SEL 2 /* Kernel Data Descriptor */ 

♦define GLDT_SEL 3 /* LDT - eventually one per process */ 

♦define GTGATE_SEL 4 /* Process task switch gate */ 

♦define GPANIC_SEL 5 /* Task state to consider panic from */ 

♦define GPROCO_SEL 6 /* Task state process slot zero and up */ 

union descriptor gdt[GPROCO_SEL+NPROC]; 

/* interrupt descriptor table */ 

struct gate_descriptor idt[NEXECPT+NINTR]; 

/* local descriptor table */ 

♦define LSYS5CALLS_SEL 0 /* SVID/BCS 386 system call gate */ 

♦define LSYS5SIGR_SEL 1 /* SVID/BCS 386 sigreturn() */ 

♦define LBSDCALLS_SEL 2 /* BSD experimental system calls */ 

♦define LUCODE_SEL 3 /* user process code descriptor */ 

♦define LUDATA_SEL 4 /* user process data descriptor */ 

union descriptor ldt[LUDATA_SEL+1]; 

/* Task State Structures (TSS) for hardware context switch */ 
struct i386tss tss[NPROC], ptss; 

/* software prototypes — in more palitable form */ 

struct soft_segment_descriptor gdt_segs[GPROCO_SEL+NPROC] = { 

/* Null Descriptor */ 

{ 0x0, /* segment base address */ 

0 x0, /* length - all address space */ 


62 



/* segment type */ 

/* segment descriptor priority level */ 
/* segment descriptor present */ 


0 , 

0 , 

0 , 

0 , 0 , 

0, /* default 32 vs 16 bit size */ 

0 /* limit granularity (byte/page units)*/ } 

/* Code Descriptor for kernel */ 

0 x0, /* segment base address */ 

Oxfffff, /* length - all address space */ 

SDT_MEMERA, /* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 


/* default 32 vs 16 bit size */ 


0 , 

1 , 

0 , 0 , 

1 , 

1 /* limit granularity (byte/page units)*/ } 

/* Data Descriptor for kernel */ 

0 x 0 , /* segment base address */ 

Oxfffff, /* length - all address space */ 

SDT_MEMRWA, /* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 


0 , 

1, 

0 , 0 , 

1, /* default 32 vs 16 bit size */ 

1 /* limit granularity (byte/page units)*/ } 

/* LDT Descriptor */ 


(int) ldt, 
sizeof(ldt)-] 

SDT_SYSLDT, 

o, /■> 

1, / J 

0 , 0 , 

0, /* unused - default 32 vs 16 bit 

0 /* limit granularity (byte/page 

/* Null Descriptor - Placeholder */ 


/* segment base address 
/* length - all address 
/* segment type */ 
segment descriptor priority 1 
segment descriptor present */ 


*/ 

space */ 
evel */ 


size */ 
units)*/ } 


0 x 0 , 
0 x 0 , 

0, 

0, 

0, 

0 , 0 , 

0, 

0 


/* segment base address */ 

/* length - all address space */ 

' segment type */ 

r segment descriptor priority level */ 
' segment descriptor present */ 


default 32 vs 16 bit size */ 
/* limit granularity (byte/page 
/* Panic Tss Descriptor */ 

(int) &ptss, /* segment base addre 

sizeof(tss)- 1 , /* length - all address 

SDT_SYS386TSS, /* segment type */ 

0 , /* segment descriptor priority 1 

1 , /* segment descriptor present */ 

0 , 0 , 

0, /* unused - default 32 vs 16 bit 

0 /* limit granularity (byte/page 

/* Process 0 Tss Descriptor */ 

(int) &tss[ 0 ], /* segment base addres 

sizeof(tss)- 1 , /* length - all address 

SDT_SYS386TSS, /* segment type */ 

0 , /* segment descriptor priority 1 

1 , /* segment descriptor present */ 


units)*/ } 

ss */ 
space */ 

evel */ 


size */ 
units)*/ } 

s */ 
space */ 

evel */ 


63 



0 , 0 , 

0, 

0 


/* unused - default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ } }; 


struct soft_se 
/* Null Des 
{ 0x0, 

0x0, 

0 , 

0 , 

0 , 

0 , 0 , 

0 , 

0 

/* Null Des 
{ 0x0, 

0x0, 

0 , 

0 , 

0 , 

0 , 0 , 

0 , 

0 

/* Null Des 
{ 0x0, 

0x0, 

0 , 

0 , 

0 , 

0 , 0 , 

0 , 

0 

/* Code Des 
{ 0x0, 

Oxfffff, 
SDT_MEMERA, 

SEL_UPL, 

1 , 

0 , 0 , 

1 , 

1 

/* Data Des 
{ 0x0, 

Oxfffff, 
SDT_MEMRWA, 

SEL_UPL, 

1 , 

0 , 0 , 

1 , 

1 


gment_descriptor ldt_segs[] = { 
criptor - overwritten by call gate */ 

/* segment base address */ 

/* length - all address space */ 

/* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 

/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ }, 
criptor - overwritten by call gate */ 

/* segment base address */ 

/* length - all address space */ 

/* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 

/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ }, 
criptor - overwritten by call gate */ 

/* segment base address */ 

/* length - all address space */ 

/* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 

/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ }, 
criptor for user */ 

/* segment base address */ 

/* length - all address space */ 

/* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 


/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ }, 
criptor for user */ 

/* segment base address */ 

/* length - all address space */ 

/* segment type */ 

/* segment descriptor priority level */ 

/* segment descriptor present */ 

/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ } }; 


extern ssdtosdO, lgdt () , lidt(), lldt(), usercode () , touser() 


init386 () { 

/* make gdt memory segments */ 

for (x=0; x < sizeof gdt / sizeof gdt[0] ; x++) 


64 



ssdtosd(gdt_segs+x, gdt+x); 
printf("lgdt\n"); getchar(); 
lgdt(gdt, sizeof(gdt)-1); 

/* make ldt memory segments */ 

for (x=0; x < sizeof ldt / sizeof ldt[0] ; x++) 
ssdtosd(ldt_segs+x, ldt+x); 

/* make a call gate to reenter kernel with */ 

setgate(&ldt[LSYS5CALLS_SEL].gd, &IDTVEC(syscall), SDT_SYS386CGT, 
SEL_UPL); 

printf("lldt\n"); getchar(); 
lldt(GSEL(GLDT_SEL, SEL_KPL)); 

/* [excerpted from srt.s] */ 

/* lgdt(*gdt, ngdt) */ 

.globl _lgdt 
gdesc: .word 0 

.long 0 
_lgdt: 

movl 4(%esp),%eax 
movl %eax,gdesc+2 
movl 8(%esp),%eax 
movw %ax,gdesc 
lgdt gdesc 

jmp If /* flush instruction prefetch q */ 

nop 

1 : movw $0x10,%ax /* reload other "well known" descriptors */ 

movw %ax,%ds 
movw %ax,%es 
movw %ax,%ss 
movl 0(%esp),%eax 
pushl %eax 

movl $8,4(%esp) /* including the ever popular CS */ 

lret 

/* lldt(sel) */ 

.globl _lldt 
_lldt: 

lldt 4(%esp) 
ret 


[LISTING SEVEN] 

/* segments.h: Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 6/20/1989 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* 386 Segmentation Data Structures and definitions 
*/ 


65 



/* Selectors */ 

♦define ISPL(s) ((s)&3) /* what is the priority level of a selector */ 

♦define SEL_KPL 0 /* kernel priority level */ 

♦define SEL_UPL 3 /* user priority level */ 

♦ define ISLDT (s) ((s)&SEL_LDT) /* is it local or global */ 

♦define SEL_LDT 4 /* local descriptor table */ 

♦ define IDXSEL(s) (((s)»3) & Oxlfff) /* index of selector */ 

♦ define LSEL(s,r) (((s)«3) | SEL_LDT | r) /* a local selector */ 

♦ define GSEL(s,r) (((s)«3) j r) /* a global selector */ 

/* Memory and System segment descriptors */ 

struct segment_descriptor { 

unsigned sd_lolimit:16 ; /* segment extent (lsb) */ 

unsigned sd_lobase:24 ; /* segment base address (lsb) */ 

unsigned sd_type:5 ; /* segment type */ 

unsigned sd_dpl:2 ; /* segment descriptor priority level */ 

unsigned sd_p:l ; /* segment descriptor present */ 

unsigned sd_hilimit:4 ; /* segment extent (msb) */ 

unsigned sd_xx:2 ; /* unused */ 

unsigned sd_def32:l ; /* default 32 vs 16 bit size */ 

unsigned sd_gran:1 ; /* limit granularity (byte/page units)*/ 

unsigned sd_hibase:8 ; /* segment base address (msb) */ 

} ; 

/* Gate descriptors (e.g. indirect descriptors) */ 
struct gate_descriptor { 

unsigned gd_looffset:16 ; /* gate offset (lsb) */ 

unsigned gd_selector:16 ; /* gate segment selector */ 

unsigned gd_stkcpy:5 ; /* number of stack wds to cpy */ 

unsigned gd_xx:3 ; /* unused */ 

unsigned gd_type:5 ; /* segment type */ 

unsigned gd_dpl:2 ; /* segment descriptor priority level */ 

unsigned gd_p:l ; /* segment descriptor present */ 

unsigned gd_hioffset:16 ; /* gate offset (msb) */ 

} ; 

/* Generic descriptor */ 
union descriptor { 

struct segment_descriptor sd; 
struct gate_descriptor gd; 

}; 

♦define d_type gd.gd_type 

/* system segments and gate types */ 

♦define SDT_SYSNULL 0 /* system null */ 

♦define SDT_SYS286TSS 1 /* system 286 TSS available */ 

♦define SDT_SYSLDT 2 /* system local descriptor table */ 

♦define SDT_SYS286BSY 3 /* system 286 TSS busy */ 

♦define SDT_SYS286CGT 4 /* system 286 call gate */ 

♦define SDT_SYSTASKGT 5 /* system task gate */ 

♦define SDT_SYS286IGT 6 /* system 286 interrupt gate */ 

♦define SDT_SYS286TGT 7 /* system 286 trap gate */ 

♦define SDT_SYSNULL2 8 /* system null again */ 

♦define SDT_SYS386TSS 9 /* system 386 TSS available */ 

♦define SDT_SYSNULL3 10 /* system null again */ 

♦define SDT_SYS386BSY 11 /* system 386 TSS busy */ 

♦define SDT_SYS386CGT 12 /* system 386 call gate */ 


66 



#define SDT_SYSNULL4 13 /* system null again */ 

♦define SDT_SYS386IGT 14 /* system 386 interrupt gate */ 

♦define SDT_SYS386TGT 15 /* system 386 trap gate */ 

/* memory segment types */ 

♦define SDT_MEMRO 16 /* memory read only */ 

♦define SDT_MEMROA 17 /* memory read only accessed */ 

♦define SDTJMEMRW 18 /* memory read write */ 

♦define SDT_MEMRWA 19 /* memory read write accessed */ 

♦define SDT_MEMROD 20 /* memory read only expand dwn limit */ 

♦define SDT_MEMRODA 21 /* memory read only expand dwn limit accessed */ 

♦define SDT_MEMRWD 22 /* memory read write expand dwn limit */ 

♦define SDT_MEMRWDA 23 /* memory r/w expand dwn limit acessed */ 

♦define SDT_MEME 24 /* memory execute only */ 

♦define SDT_MEMEA 25 /* memory execute only accessed */ 

♦define SDT_MEMER 26 /* memory execute read */ 

♦define SDT_MEMERA 27 /* memory execute read accessed */ 

♦define SDT_MEMEC 28 /* memory execute only conforming */ 

♦define SDT_MEMEAC 29 /* memory execute only accessed conforming */ 

♦define SDT_MEMERC 30 /* memory execute read conforming */ 

♦define SDT_MEMERAC 31 /* memory execute read accessed conforming */ 

/* is memory segment descriptor pointer ? */ 

♦define ISMEMSDP(s) ((s->d_type) >= SDT_MEMRO && (s->d_type) <= SDT_MEMERAC) 
/* is 286 gate descriptor pointer ? */ 

♦define IS286GDP(s) (((s->d_type) >= SDT_SYS286CGT \ 

&& (s->d_type) < SDT_SYS286TGT)) 

/* is 386 gate descriptor pointer ? */ 

♦define IS386GDP(s) (((s->d_type) >= SDT_SYS386CGT \ 

&& (s->d_type) < SDT_SYS386TGT)) 

/* is gate descriptor pointer ? */ 

♦define ISGDP(s) (IS286GDP(s) || IS386GDP(s)) 

/* is segment descriptor pointer ? */ 

♦define ISSDP(s) (ISMEMSDP(s) || !ISGDP(s)) 

/* is system segment descriptor pointer ? */ 

♦define ISSYSSDP(s) (!ISMEMSDP(s) && !ISGDP(s)) 

/* Software definitions are in this convenient format; translated into 
* inconvenient segment descriptors when needed to be used by 386 hardware */ 
struct soft_segment_descriptor { 

unsigned ssd_base ; /* segment base address */ 

unsigned ssd_limit ; /* segment extent */ 

unsigned ssd_type:5 ; /* segment type */ 

unsigned ssd_dpl:2 ; /* segment descriptor priority level */ 

unsigned ssd_p:l ; /* segment descriptor present */ 

unsigned ssd_xx:4 ; /* unused */ 

unsigned ssd_xxl:2 ; /* unused */ 

unsigned ssd_def32:1 ; /* default 32 vs 16 bit size */ 

unsigned ssd_gran:l ; /* limit granularity (byte/page units)*/ 

}; 

extern ssdtosd() ; /* to decode a ssd */ 

extern sdtossd() ; /* to encode a sd */ 


67 



/* region descriptors, used to load gdt/idt tables before segments yet exist */ 
struct region_descriptor { 

unsigned rd_limit:16 ; /* segment extent */ 

char *rd_base; /* base address */ 

}; 

/* Segment Protection Exception code bits */ 

♦define SEGEX_EXT 0x01 /* recursive or externally induced */ 

♦define SEGEX_IDT 0x02 /* interrupt descriptor table */ 

♦define SEGEX_TI 0x04 /* local descriptor table */ 

/* other bits are affected descriptor index */ 

♦define SEGEX_IDX(s) ((s)>>3)&0xlfff) 

[LISTING EIGHT] 


/* [excerpted from i386.c] */ 

/* Assemble a gate descriptor */ 

setgate(gp, func, typ, dpi) char *func; struct gate_descriptor *gp; { 
gp->gd_looffset = (int)func; 
gp~>gd_selector = GSEL(GCODE_SEL,SEL_KPL); 
gp->gd_stkcpy = 0; 
gp->gd_xx = 0; 
gp->gd_type = typ; 
gp->gd_dpl = dpi; 

gp->gd_p =1; /* definitely present */ 

gp->gd_hioff set = ( (int) func) »16 ; 


/* ASM entry points to exception/trap/interrupt entry stub code. */ 

♦ define IDTVEC(name) XHname 

extern 


IDTVEC(div), IDTVEC(dbg), IDTVEC(nmi), IDTVEC(bpt), IDTVEC(ofl), 

IDTVEC(bnd), IDTVEC(ill), IDTVEC(dna), IDTVEC(dble), IDTVEC(fpusegm), 
IDTVEC(tss), IDTVEC(missing), IDTVEC(stk), IDTVEC(prot), IDTVEC(page), 
IDTVEC(rsvd), IDTVEC(fpu), IDTVEC(rsvdO), IDTVEC(rsvdl), IDTVEC(rsvd2), 
IDTVEC(rsvd3), IDTVEC(rsvd4), IDTVEC(rsvd5), IDTVEC(rsvd6), 

IDTVEC(rsvd7) , IDTVEC(rsvd8), IDTVEC(rsvd9) , IDTVEC(rsvdl0) , 

IDTVEC(rsvdll), IDTVEC(rsvdl2), IDTVEC(rsvdl3), IDTVEC(rsvdl4), 

IDTVEC(rsvdl4), IDTVEC(intrO) , IDTVEC(intrl), IDTVEC(intr2) , 

IDTVEC(intr3), IDTVEC(intr4), IDTVEC(intr5), IDTVEC(intr6), 

IDTVEC(intr7) , IDTVEC(intr8), IDTVEC(intr9) , IDTVEC(intrl0) , 

IDTVEC(intrll), IDTVEC(intr12), IDTVEC(intrl3), IDTVEC(intrl4), 

IDTVEC(intrl5), IDTVEC(syscall) ; 
init386() { 


/* exceptions */ 


setgate(idt+0, 
setgate(idt+1, 
setgate(idt+2, 
setgate(idt+3, 
setgate(idt+4, 
setgate(idt+5, 
setgate(idt+6, 
setgate(idt+7, 
setgate(idt+8, 
setgate(idt+9. 


&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 
&IDTVEC 


(div) 
(dbg) 
(nmi) 
(bpt) 
(ofl) 
(bnd) 
(ill) 
(dna) 
(dble 
(fpus 


) , 


SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS386TGT, 
SDT_SYS3 8 6TGT, 


SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 
SEL_KPL) ; 


egm), SDT_SYS386TGT, SEL_KPL) 


68 



setgate(idt+10, &IDTVEC(tss), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+11, &IDTVEC(missing), SDT_SYS386TGT, SEL_KPL); 

setgate(idt+12, &IDTVEC(stk), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+13, &IDTVEC(prot), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+14, & IDTVEC(page), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+15, & IDTVEC(rsvd), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+16, & IDTVEC(fpu), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+17, & IDTVEC(rsvdO), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+18, & IDTVEC(rsvdl), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+19, & IDTVEC(rsvd2), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+20, & IDTVEC(rsvd3), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+21, & IDTVEC(rsvd4), SDT_SYS386TGT, SEL_KPL); 

setgate(idt+22, & IDTVEC(rsvd5), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+23, & IDTVEC(rsvd6), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+24, & IDTVEC(rsvd7), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+25, & IDTVEC(rsvd8), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+26, & IDTVEC(rsvd9), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+27, & IDTVEC(rsvdlO), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+28, & IDTVEC(rsvdl1), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+29, &IDTVEC(rsvdl2), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+30, &IDTVEC(rsvdl3), SDT_SYS386TGT, SEL_KPL); 
setgate(idt+31, & IDTVEC(rsvdl4), SDT_SYS386TGT, SEL_KPL); 

/* first icu */ 

setgate(idt+32, & IDTVEC(intrO), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+33, & IDTVEC(intrl), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+34, & IDTVEC(intr2), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+35, & IDTVEC(intr3), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+36, & IDTVEC(intr4), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+37, & IDTVEC(intr5), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+38, &IDTVEC(intr6), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+39, & IDTVEC(intr7), SDT_SYS386IGT, SEL_KPL); 

/* second icu */ 

setgate(idt+40, & IDTVEC(intr8), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+41, &IDTVEC(intr9), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+42, &IDTVEC(intrlO), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+43, & IDTVEC(intrl1), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+44, & IDTVEC(intrl2), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+45, &IDTVEC(intrl3), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+46, & IDTVEC(intrl4), SDT_SYS386IGT, SEL_KPL); 
setgate(idt+47, &IDTVEC(intrl5), SDT_SYS386IGT, SEL_KPL); 

printf("lidt\n"); getchar(); 
lidt(idt, sizeof(idt)-1); 

/* [excerpted from srt.s] */ 

/* lidt(*idt, nidt) */ 

.globl _lidt 
idesc: .word 0 

.long 0 
_lidt: 

movl 4(%esp),%eax 
movl %eax,idesc+2 


69 



movl 8(%esp),%eax 
movw %ax,idesc 
lidt idesc 
ret 

[LISTING NINE] 

/* [excerpted from i386.c] */ 

#define NBPG 4096 /* number of bytes per page */ 

♦define PG_V 0x00000001 /* mark this page as valid */ 

♦define PG_UW 0x00000006 /* user and supervisor writable */ 

int lcr0(), lcr3(); 

init386() { 

/* bag of bytes to put page table, page directory in */ 
static char bag[(1+1+1)*NBPG]; 
int *ppte, *pptd, *cr3, x; 

/* make page table & directory aligned to NBPG */ 
ppte = (int *) (((int) bag + NBPG-1) & -(NBPG-l)); 

cr3 = pptd = ppte + 1024; 

/* page table directory only has lowest 4MB entry mapped */ 

*pptd++ = (int) ppte + (PG_V|PG_UW); 

for (x = 1; x < 1024 ; x++,pptd++) *pptd = 0; 

/* page table, all entrys virtual == real, user/supervisor r/w */ 
for (x = 0; x < 1024 ; x++,ppte++) *ppte = x*NBPG + (PG_v|PG_UW) 

/* turn on paging */ 
lcr3(cr3); 

printf("paging" ); getchar(); 
lcrO(0x80000001); 


/* [excerpted from srt.s] */ 
/* 

* lcr3(cr3) 

*/ 

.globl _lcr3 
_lcr3: 

movl 4(%esp),%eax 
movl %eax,%cr3 
ret 

/* lcrO(crO) */ 

.globl _lcr0 
_lcr0 : 

movl 4(%esp),%eax 
movl %eax,%cr0 
ret 


70 



[LISTING TEN] 

/* [excerpted from i386.c] */ 
init386(){ 

/* make a initial tss so 386 can get interrupt stack on syscall! */ 

tss[0].tss_espO = (int) &x - 4096; 

tss [0] .tss_ss0 = GSEL(GDATA_SEL, SEL_KPL) ; 

tss[0].tss_cr3 = (int) cr3; 

printf("ltr "); getchar(); 

ltr(GSEL(GPROCO_SEL, SEL_KPL) ) ; 

printf("resume() "); getchar(); 

/* set busy type to avail */ 

gdt[GPROCO_SEL],sd.sd_type = SDT_SYS386TSS; 

/* jump to self to fill out tss, like BSD resume() */ 
jmptss(GSEL(GPROCO_SEL, SEL_KPL)); 

# excerpted from srt.s 

/* jmptss(sel)— Jump to TSS so that we can load/unload context */ 

.globl _jmptss /* similar to BSD swtch()/resume() */ 

_jmptss: 

1jmp 0 (%esp) /* 1jmp tss */ 

/* saved pc points here */ 

ret 

[LISTING ELEVEN] 

/* [excerpted from i386.c] */ 
test386(){ 

/* test handling exceptions */ 
printf("breakpoint "); getchar(); 
asm (" int $3 "); 

/* Trap exception processing code */ 

trap(es, ds, edi, esi, ebp, dummy, ebx, edx, ecx, eax, 
fault, ec, eip, cs, eflags, esp, ss) { 

printf("pc:%x cs:%x ds:%x eflags:%x ec %x fault %x crO %x cr2 %x \n", 
eip, cs, ds, eflags, ec, fault, rcrO(), rcr2()); 
printf("edi %x esi %x ebp %x ebx %x edx %x ecx %x eax %x\n", 
edi, esi, ebp, ebx, edx, ecx, eax) ; 
eip++; /* simple way to 'jump' over fault */ 
getchar(); 


71 



[LISTING TWELVE] 


♦ excerpted from srt.s 
♦include 

♦ define IDTVEC(name) .align 4; .globl _XHname; 

/* Trap and fault vector routines */ 

♦define TRAP(a) pushl $##a ; jmp alltraps 


IDTVEC(div) 
pushl $0; 

IDTVEC(dbg) 
pushl $0; 

IDTVEC(nmi) 
pushl $0; 

IDTVEC(bpt) 
pushl $0; 

IDTVEC(o f1) 
pushl $0; 

IDTVEC(bnd) 
pushl $0; 

IDTVEC(ill) 
pushl $0; 

IDTVEC(dna) 


TRAP(T_DIVIDE) 
TRAP(T_DEBUG) 

TRAP(T_NMI) 

TRAP(T_BPTFLT) 
TRAP(T_OFLOW) 

TRAP(T_BOUND) 

TRAP(T_PRIVINFLT) 


pushl $0; TRAP(T_DNA) 
IDTVEC(dble) 

TRAP(T_DOUBLEFLT) 


IDTVEC(fpusegm) 

pushl $0; TRAP(T_FPOPFLT) 
IDTVEC(tss) 

TRAP(T_TSSFLT) 

IDTVEC(missing) 

TRAP(T_SEGNPFLT) 

IDTVEC(stk) 

TRAP(T_STKFLT) 

IDTVEC(prot) 

TRAP(T_PROTFLT) 


IDTVEC(page) 

TRAP(T_PAGEFLT) 
IDTVEC(rsvd) 


pushl $0; TRAP(T_RESERVED) 
IDTVEC(fpu) 

pushl $0; TRAP(T_ARITHTRAP) 

/* 17 - 31 reserved for future 
IDTVEC(rsvdO) 

pushl $0; TRAP(17) 

IDTVEC(rsvdl) 

pushl $0; TRAP(18) 

IDTVEC(rsvd2) 

pushl $0; TRAP(19) 

IDTVEC(rsvd3) 

pushl $0; TRAP(20) 

IDTVEC(rsvd4) 

pushl $0; TRAP(21) 

IDTVEC(rsvd5) 


exp 


*/ 


_XHname 


72 



pushl $0; TRAP(22) 
IDTVEC(rsvd6) 

pushl $0; TRAP(23) 
IDTVEC(rsvd7) 

pushl $0; TRAP(24) 
IDTVEC(rsvd8) 

pushl $0; TRAP(25) 
IDTVEC(rsvd9) 

pushl $0; TRAP(26) 
IDTVEC(rsvdlO) 

pushl $0; TRAP(27) 
IDTVEC(rsvdll) 

pushl $0; TRAP(28) 
IDTVEC(rsvdl2) 

pushl $0; TRAP(29) 
IDTVEC(rsvdl3) 

pushl $0; TRAP(30) 
IDTVEC(rsvdl4) 

pushl $0; TRAP(31) 


alltraps: 
pushal 

push %ds # save old selector's we will use 

push %es 

movw $0x10,%ax # load them with kernel global data sel 

movw %ax,%ds 

movw %ax,%es 

call _trap 

pop %es 

pop %ds 

popal 

addl $8,%esp # pop type, code 

iret 


/* [excerpted from i386.c] */ 


init386() { 

outb(Oxf1,0); /* clear coprocessor to cover all bases */ 

/* initialize 8259 ICU's in preperation for device interrupts */ 


outb(ICU1,0x11); 
outb(ICU1 + 1,32) ; 
outb(ICU1+1,4); 
outb(ICU1+1,1); 
outb(ICU1+1,Oxff); 


/* reset the unit */ 

/* start with idt 32 */ 
/* master please */ 

/* all disabled */ 


outb(ICU2,0x11); 

outb(ICU2+1,40); /* start with idt 40 */ 

outb(ICU2+1,2); /* just a slave */ 


73 



outb(ICU2+1,1); 

outb(ICU2+1,Oxff); /* all disabled */ 

/* initialize 8253 timer on interrupt #0 */ 

outb (0x43, 0x36); 

outb (0x40, 193182/60); 

outb (0x40, (193182/60)/256); 

} 

test386 (){ 

/* test interrupts for a while */ 
printf("inton"); getchar(); 

outb(ICU1+1,0); /* unmask all interrupts */ 

outb(ICU2+1,0); 
inton(); 

timeout = 0x8000000; 

do nothing(); while (timeout— ); 

intoff (); 


# excerpted from srt.s 

♦define INTR(a) \ 
pushal ; \ 
push %ds ; \ 

push %es ; \ 

movw $0x10, %ax ; \ 

movw %ax, %ds ; \ 

movw %ax,%es ; \ 

pushl $##a ; \ 

call _intr ; \ 
pop %eax ; \ 
pop %es ; \ 
pop %ds ; \ 
popal ; \ 
iret 

/* hardware 32 - 47 */ 
IDTVEC(intrO) 

INTR(0) 

IDTVEC(intrl) 

INTR(l) 

IDTVEC(intr2) 

INTR(2) 

IDTVEC(intr3) 

INTR(3) 

IDTVEC(intr4) 

INTR(4) 

IDTVEC(intr5) 

INTR(5) 

IDTVEC(intr6) 

INTR(6) 

IDTVEC(intr7) 

INTR(7) 


74 



IDTVEC(intr8) 

INTR(8) 

IDTVEC(intr9) 

INTR(9) 

IDTVEC(intrlO) 

INTR(10) 

IDTVEC(intrll) 

INTR(11) 

IDTVEC(intrl2) 

INTR(12) 

IDTVEC(intrl3) 

INTR(13) 

IDTVEC(intrl4) 

INTR(14) 

IDTVEC(intrl5) 

INTR(15) 

.globl _inton 
_inton: 
sti 
ret 

.globl _intoff 
_intoff: 
cli 
ret 


/* back to i386.c */ 

/* Interrupt vector processing code */ 
intr(ivec) { 
static elk; 
int omskl, omsk2; 

/* mask off interrupt being serviced, save old mask */ 
if (ivec > 7) { 

omsk2 = inb(ICU2+l); 
outb(ICU2 + l, 1« (ivec-8) ) ; 

} else { 

omskl = inb(ICUl+l); 
outb(ICU1 + 1, 1< 7) { 

outb(ICU2 + 1,omsk2) ; 
outb(ICU2,0x20); 

} else 

outb(ICU1+1,omskl); 
outb(ICU1,0x20); 

/* return to interrupt stub */ 


75 



[LISTING THIRTEEN] 


test386 () { 

/* transfer to user mode to test system call */ 
printf("touser "); getchar(); 

touser (LSEL(LUCODE_SEL,SEL_UPL), LSEL(LUDATA_SEL, SEL_UPL), Susercode); 

♦ [excerpted from srt.s] 

/* touser (cs,ds,func) */ 

.globl _touser 
_touser: 
pushal 

movl %esp,_myspback 
movl 32+4(%esp),%eax 
movl 32+8(%esp),%edx 
movl 32 + 12 (%esp),%ecx 

# build outer stack frame 
pushl %edx # user ss 

pushl %esp # user esp 

pushl %eax # cs 

pushl %ecx # ip 

movw %dx,%ds 

movw %dx,%es 

lret ♦ goto user! 

/* code to execute in user mode */ 

.globl _usercode 

#define LCALL(x,y) .byte 0x9a ; .long y; .word x 
_usercode: 

LCALL(0x7,0x0) /* would be lcall $0x7,0x0 except for assembler bug */ 

IDTVEC(syscall) 
pushal 

movw $0x10,%ax 
movw %ax,%ds 

movw %ax,%es 

call _syscall 

movl _myspback,%esp /* non-local goto touser() exit */ 

popal 

ret 

/* back to i386.c */ 

/* System call processing */ 
syscall () { 

printf("syscall\n"); 

} 

[LISTING FOURTEEN] 

/* trap.h: i386 trap type index [as they intersect with other BSD systems] */ 

#define T_PRIVINFLT 1 /* privileged instruction */ 

♦define T_BPTFLT 3 /* breakpoint instruction */ 

♦define T_ARITHTRAP 6 /* arithmetic trap */ 

♦define T_PROTFLT 9 /* protection fault */ 


76 



♦define 

T_ 

_PAGEFLT 

12 

/ 

* page fault */ 


♦define 

T_ 

.DIVIDE 

18 

/* 

integer divide fault 

*/ 

♦define 

T_ 

_NMI 

19 

/* 

non-maskable trap */ 


♦define 

T_ 

_OFLOW 

20 


/* overflow trap */ 


♦define 

T_ 

.BOUND 

21 


/* bound instruction 

fault */ 

♦define 

T_ 

_DNA 

22 

/* 

device not available 

fault */ 


♦define T_DOUBLEFLT 23 /* double fault */ 

♦define T_FPOPFLT 24 /* fp coprocessor operand fetch fault */ 

♦define T_TSSFLT 25 /* invalid tss fault */ 

♦define T_SEGNPFLT 26 /* segment not present fault */ 

♦define T_STKFLT 27 /* stack fault */ 

♦define T_RESERVED 28 /* reserved fault base */ 

[LISTING FIFTEEN] 

/* [excerpted from i386.c] */ 
test386(){ 

int x, *pi, timeout; 

/* generate a page fault exception */ 
printf("dopagflt\n"); getchar(); 
pi = (int *) 0x800000; /* above 4MB */ 

x = *pi; /* will fault invalid read */ 

*pi = ++x ; /* will fault invalid write */ 


Porting Unix To The 386: Language Tools Cross Support 

Developing the inital cross-tool utilities to bootstrap 386BSD. 

Developing the initial utilities 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, which was the first virtual memory microprocessor-based UNIX system. 
Prior to establishing TeleMuse, a market research firm, Lynne was vice president of marketing at 
Symmetric Computer Systems. Bill and Lynne conduct seminars on BSD, ISDN and TCP/IP. Send e-mail 
questions or comments to lynne@berkeley .edu. Copyright (c) 1991 TeleMuse. 

We stated last month that "Projects of great complexity are always uncertain" and then we went on to 
develop our standalone system. Now we must examine what we accomplished. Recall that last month, we 
started with an empty 386 residing in protected mode without one shred of reliable code: just three little 
PC utilities to facilitate software loading and bootstrap operation. Using our protected-mode program 
loader, we created a minimal 80386 protected-mode standalone C programming environment for operating 
systems kernel development work. Then we wrote prototype code for various kernel hardware support 
facilities. Finally, we used our standalone programming environment as a testbed to shake out the bugs in 
our first-cut implementation of kernel 386 machine-dependent code in preparation for incorporation in the 
BSD kernel. Following our specification methodology, we created a suitable standalone system and 


77 



conquered a number of latent software bugs and misconceptions. 

With our standalone system, we have essentially established the "base camp" on our 386 expedition. We 
now possess much of the "gear" (utilities, compiler and assembler, and other equipment) required for such 
an adventure, but we must check it out and test it prior to actual use. As any good mountaineer knows, 
thorough knowledge of your equipment could save your life. In this case, an adherence to appropriate 
testing and coding procedures could save a project. 

As we stated earlier, the standalone system can be viewed at this stage as if it were the kernel itself, with 
the extensions as the basis of our prototype kernel code. We now continue up the base of the mountain, 
furthering our initial utilities development through the creation of a stable cross-tools environment. 

Why Develop Cross-Tools? 

We have mentioned little about our protected-mode software generation mechanism in previous articles. 

In this article, we describe our set of tools that allows us to port 386BSD. Since we don’t have 386BSD to 
generate 386BSD (yet), we must use another UNIX host to run the tools and generate protected-mode 
software; this "cross" mode operation is part of the means by which we bootstrap 386BSD. In our case, the 
cross-host that runs the software generating 386 code isn’t even a 386! 

Because the computer we use to generate the software is not the one that runs it, we will need a means to 
load files and program over Ethernet and serial lines to the target 386 system. We will then focus on 
proving GCC itself valid for cross-support purposes. The mechanisms used for this "first assault" will be 
of great importance until we have developed a stable native environment. Careful preparation in this area 
will allow us to weather the blinding "blizzards" of bugs which will inevitably arise on our way to the top. 

386BSD Cross-Tools Goals 

A proper evaluation of our cross-tools was crucial to the successful generation of the earliest version of 
386BSD—before the system had the ability to generate its own binaries. While everyone always wants to 
use the very best tools possible in all cases, we decided that what we wanted from our cross-tools was 
simply to be able to generate enough of an operational BSD kernel and utilities to run our language tools 
in a native environment. Ultimately, we want to use native tools because they are more convenient, have a 
shorter "compile-edit-debug" cycle, are easier to support (for example, just one architecture to worry 
about), and use much of the traditional program development aids provided in BSD UNIX. 

In a nutshell, BSD, like most UNIX systems, expects to be developed in a native environment. 

As such, our principle concern at this stage is with correctness, not optimization. Performance 
considerations arise only after we achieve an operational system that can be refined using traditional 
means. This first "bootstrap" version of utilities and kernel is compromised in areas where our 
cross-support mechanisms are weakest. However, if carefully selected, we can jettison these compromised 
areas when we "go native." 


More Details. 


78 




Both the kernel and early utilities are predominantly written in C, with some assembler support. Before a 
self-supporting kernel exists, approximately 250,000 lines of C code must be made operational via the 
cross-support. The chance of discovering compiler bugs, or cross-support-induced bugs, is almost certain. 

What’s in the Tool Chest? 

Our tool chest of cross-tools consists of the following: 

The C Compiler: The bulk of our effort is organized around the C compiler. For the 386BSD project, we 
relied upon the Free Software Foundation’s (FSF) GCC compiler, version 1.34. At the beginning of this 
port, we had little familiarity with the strengths and weaknesses of GCC. We were also uncertain about its 
usefulness as an operating systems development tool, as it appeared primarily alongside other 386 C 
compilers on extant System 5 UNIX systems. Unfortunately, we cannot supply written code fragment 
examples from GCC (or any other FSF software) in this article due to constraints of the "copyleft" (see the 
accompanying text box entitled "Brief Notes: Copyrights, Copylefts, and Competitive Advantage"). 

386 Protected-mode Assembler: The remaining 386 code, particularly the code used for interfacing to 
non-C mechanisms and data structures needed to support i386 and ISA hardware functionality, was 
written in assembly language. The FSF’s GAS assembler was used for this purpose, more out of need than 
preference. The great majority of problems we encountered with the port were traced to "hidden surprises" 
and "features" in GAS, which we bypassed with clever use of inline code and other contrivances. GAS is 
functional and proven, if not pretty. 

Linker-Loader: Object modules created by GAS were linked together by an object module linkage editor. 
We had a wide variety of candidates available from BSD, FSF, and others from which to choose. 

However, because the object file format exactly matched the arrangement of our cross-host (a National 
Series 32000 machine), we put off the ultimate decision by using our cross-host’s native UNIX Id 
command. This worked without modification to our satisfaction. 

Communications and File Transfer: We needed a way to get programs and files created on our cross-host 
transferred to our 386 PC. Many cross-host to PC communications programs are available, and we settled 
on Kermit and NCSA Telnet (ftp) to do the job. 

Protected-mode Loader: Once we had transferred the programs to the PC, we used our protected-mode 
loader program (see "Three Initial PC Utilities" in DDJ, February 1991 ) to load the programs and execute 
them in 386 protected mode. 

Ancillary Tools: In addition to the heavy hitters, various minor commands are also needed to create and 
organize the object libraries. Commands such as ar, ranlib, nm, and lorder were required. Again, like the Id 
command above, we were able to use the cross-host’s native commands due to the identical executable 
format and byte order of cross-host and our 386. 

In addition to these programs, our cross-support facility must have the following data objects present to 
build kernel and utilities: 

Object Libraries: The standalone system (libsa.a) and utilities (libc.a and others) make great use of their 
respective library calls. These libraries satisfy, on the average, a few hundred of the function entry and 
data structure references invoked by various BSD utility programs. Most of the machine-dependent 


79 



portions of BSD utilities are located in the libraries, so the majority of effort expended in porting the 
utilities is focused on the libraries. Over the course of the 386BSD project, we wrote the 
machine-dependent code into the libraries to get a given utility operational only as needed, rather than 
writing it all at once. Incremental coding provided a tactical advantage, because by the time we needed to 
wrestle with the most difficult code, we had quite a bit of seasoned experience with the 386. 

Include Files: In addition to object libraries, we must provide a complete set of include files for use with 
our cross-support package. A simple approach might be to have all references to include files directed to a 
separate i386 include directory, but this would interfere with the pathnames invoked by a variety of 
makefiles and shell scripts, not to mention all the embedded references in the source code itself. After 
finding over a hundred references to absolute pathnames, with no end in sight, we gave up on this 
approach and did the unspeakable—put into place on the cross-host all 386 include files. By virtue of the 
shell commands to386 and back2normal, we could switch our cross-host back and forth in this manner. 
Thank goodness, no other users needed to compile native programs at the time; they would have been 
somewhat surprised! 

Cross-Support Methodology 

We can employ several standard methods to aid in our cross-support effort: regression testing, divide and 
conquer, consistency checks, and defeating optimizations. 

Regression testing is used to probe for the presence of induced bugs in every step along the way to proving 
our cross-tools. Prior to creating our cross-compiler, we generate our early test files off of a known good 
and tested implementation (in the case of 386BSD, a Sequent 386 UNIX system). The compiler output for 
some unmodified portions of the compiler and the kernel of the operating system are kept as reference 
assembly language files, for comparison against subsequent compiler versions output compiling the same 
files. An induced error would cause a difference to show up in the comparison of the two. As an example 
of this, a whole group of instructions might be missing, signifying a dropped expression left uncompiled 
by a buggy compiler. In a similar fashion, a group of object files from the assembler are also created to 
compare with those created by the assembler on our 386. 

In addition to this set of test files, a record is kept of every kind of induced bug and the source code which 
generated it. Thus, common bugs which are inadvertently reintroduced periodically can be caught without 
needing to be debugged a second time (or a third ...). 

This mechanism for tracking compiler bugs is not a panacea—it is vulnerable to error in two major ways: It 
does nothing to aid detection of "latent" bugs in the "good" version we started with; and it becomes useless 
if modifications to the compiler result in widespread changes in the output code, thus obscuring "bug" 
changes. However, it proved adequate for the short period (one to two months) it took to reliably compile 
code in native 386BSD. 

"Divide and conquer" is used to isolate the effect of multiple bugs appearing as a single impossible-to-find 
bug. It is a very powerful tool for use in certain unpleasant predicaments. For example, during the 
386BSD project, we detected the presence of a kernel bug, a compiler bug, and a library bug all hitting at 
the same point, at a time when we did not yet have an operable debugger to sort out the mess. After 
isolating the problem with blitheringly primitive printfs, we tried porting similar, related programs, until 
we found a program that isolated the library bug and the compiler bug at separate times. Once we fixed 
these bugs, we recompiled the entire set of kernel and applications programs. The remaining kernel bug 


80 



was then obvious to see and correct. Divide and conquer allowed us to solve an "unsolvable" problem. 

Consistency checks are implemented in the drivers and trap/system call handlers to detect "impossible" 
conditions, such as returning to a user program with interrupts off, a completely invalid user stack pointer, 
and so forth. At one point, we even had them in library code and inline to the C compilers assembly 
language output. Throughout the 386BSD development cycle, consistency checks provided a mechanism 
to detect a problem before it became terminal and untraceable. For example, when we converted 386BSD 
from 4.3BSD-Tahoe to 4.3BSD-Reno, consistency checks detected a disastrous problem caused by a side 
effect of the context-switch code. Consistency checks have their downside, however. Performance 
degrades with the use of consistency checks in speed-sensitive areas such as system call handling. Resist 
temptation, however, and don’t take them out just for convenience. Otherwise, mysterious problems will 
reappear and drive you crazy. 

Another type of seemingly benign tinkering which results in disaster comes when one tries various 
performance optimizations too early in the game. We ran into problems every time we tried jumping ahead 
by improving our early development code before it was fully reliable. It is better to "comment out" 
performance improvements, compiler optimization, and "short circuit" code evaluation, until the code and 
compiler are somewhat shopworn. It is very frustrating when you have found a mechanism for a section of 
code that might improve performance by an order of magnitude or more, but only at the risk of upsetting 
the kernel operation itself. Be wary of such improvements—patience is definitely a virtue in a systems 
project. 

Which C Standard? 

In the early days of Berkeley UNIX (pre-Version 6), C was not yet standardized. For example, types such 
as "unsigned" did not even exist—instead, arithmetic was done on "char*" types. Partly as a result of early 
portability experiments, Bell Labs eventually revised C to conform to a definition devised by Brian 
Kernighan and Dennis Ritchie (K&R), two Bell Labs scientists. Their book, The C Programming 
Language (Prentice-Hall, 1978), defined what C was for almost the next ten years. Berkeley then adopted 
this new "standard" for all related prior code and all new code when it began to put a serious effort into 
developing new UNIX functionality. As the use of C has grown, its popularity has necessitated the 
evolution and solidification of an ANSI specification of the language and its semantics. Pre-K&R 
adherents to C, ideological to a fault, have frequently found much amusement in this obsession with 
standards. After all, they originally had to fight management and funding group opposition to its use 
(partly on the grounds of "standardization") in many major projects for which it was well suited, and had 
to live with the barrage of Fortran, Pascal, and then Ada efforts to displace C as the preeminent systems 
programming language of the day. Perhaps those groups might finally agree that C will be around for yet a 
few years to come! 

What does this have to do with 386BSD? Plenty! It seems that some believe it is time to move BSD, 
kicking and screaming, into the ANSI C world, but others are still adherents of the K&R viewpoint. Since 
the K&R portable C compiler is still used for slowly dying architectures and is yet a force to be reckoned 
with, 386BSD must find a median solution. 386BSD has an eye towards the future, however, so a 
concerted effort has been made with 386-dependent code to work within the new ANSI C format, while 
remaining compatible with K&R C in common code by virtue of #ifdefs. 


81 



GCC attempts to remedy this conflict by providing a traditional mode, but this is inadequate to our needs. 
GCC, it turns out, is not perfectly "traditional," as it favors ANSI semantics. (This should actually be no 
surprise, as it is difficult to be complete in this regard.) As such, it is another source of "silent" bugs that 
one should be aware of because the majority of the BSD code was written to older standards. 

Other Cross-Support Issues 

In the area of cross-host communications, a few amusing irritants developed. When we first used Kermit 
and ordinary serial lines for the early standalone system and kernel work, the few minutes of download 
delay to MS-DOS were livable, given that the debugging time required for each cycle was usually about 
20 minutes. As we got more proficient with the 386, however, and as we reached the limits of our 
documentation on 386 features, our debug sessions became shorter than the download time. Also, 
downloading a kernel (100 to 200 Kbytes) or a filesystem (1 to 5 Mbytes) began to occur more frequently, 
thus eating up even more time. Finally, with the help of a cheap (approximately $100) Ethernet card, we 
migrated to NCSA Telnet. This change cut the download time to a more reasonable number. 

Success frequently results in its own problems; we rapidly filled our tiny 40-Mbyte drive. It became 
increasingly difficult to manage slightly different versions of utilities, and the cheap and clever tricks we 
had used to bypass some development steps were themselves becoming stumbling blocks. Because we 
were sharing the disk with MS-DOS and using MS-DOS utilities to communicate with the outside world, 
files had to fit in the MS-DOS partition. By this time, it was clear that the tenuous partnership between 
MS-DOS and BSD was drawing to an end. 

Validating GCC for Use in a Cross-Environment 

We found GCC to have many fine qualities—unfortunately, cross-support operation was not one of them. 
From its inception, GCC has traditionally been run on the host on which it was compiled, and little 
thought has been put into preserving its ability to run on a machine vastly different from that host. In 
addition, some architectures supported under GCC relied to some degree on the presence of a preexisting 
native compiler to compile GCC and parts of its own compiler support libraries. To be fair, the compiler 
itself is quite capable of compiling and supporting itself. However, as originally configured, both 
cross-support and compiler bootstrapping are not very satisfying. 

Other hurdles which we had to surmount included locating host compiler bugs upon compiling GCC. 
Unlike other compiler writers who attempt to minimize the use of arbitrary C features in their code, 

GCC’s creators revel in it. As a result, compiling GCC itself constitutes an excellent test of a compiler 
because of its rich use of the language, and the impressive demands (macros, pointer dereferencing) it 
places on the said compiler. While this style of implementation goes loggerheads with practical portability 
in our compromised "real" world, we must admit that the creators of GCC show fearless, if not reckless, 
faith in their compiler. No one else so completely exploits the C language, at the price of providing 
faultless support for such an extensive use of the language. The intellectual honesty required for such an 
implementation has received its fair portion of praise. 

In the course of attempting to qualify a cross-host, we attempted to compile GCC on many machines. One 
less than serious attempt was made to compile portions of GCC on MS-DOS using various common PC C 
compilers. As expected, we got dismal results. We found that to compile GCC on MS-DOS, we would 
have to extensively rewrite the code, and also use some manner of MS-DOS extender—an effort not 


82 



compatible with our specification goals. We did consider using the standalone library (see "The 
Standalone System" in DDJ, March 1991) to run GCC in native mode after compiling GCC on a borrowed 
386 system elsewhere, but gave up on this when our cross-host version of GCC stabilized. We worried 
that these two PC-hosted approaches would not only require a great deal of additional work, but also 
require us to maintain them in the future for avid users. Perhaps a fate worse than death? 

Our intended cross-host, a UNIX machine, had many problems in compiling GCC, even though the 
compiler has been part of a stable production system for many years. However, consistency checks within 
GCC itself allowed us to locate the nature of the problem to within a few thousand instructions, 
whereupon we would tediously single-step to the problem with a debugger. Since we could not fix the 
cross-host’s native compiler (frequently this would mean exchanging the bug you know for the bugs 
generated by the fix that you don’t know), we mauled GCC itself and defeated portions of the compiler in 
a successful attempt to avoid code that the native compiler would mishandle. Due to the nature of the 
native compiler bug (an obscure pointer aliasing problem), this was the only way we could convince 
ourselves that we were not just migrating the bug. As you might expect from our mention above, one of 
the best tests of our then-generated cross-compiler was GCC itself. 

Another aspect of running GCC in a cross-environment is dealing with an internal support library known 
as gnulib.a. GCC is arranged so that portions of machine-dependent operations not implemented by the 
compiler itself with issued assembly code will instead be implemented by a subroutine call to a gnulib.a 
entry point. To cleverly implement these missing areas within the compiler, one creates gnulib.a by 
compiling source code encapsulating the missing feature with the native host’s compiler (not GCC), 
relying on it to implement the missing feature as it sees fit. Here’s an example. Suppose we have the C 
expression: 

if(a != b) ... 

Let’s assume the compiler does not know how to handle !=. It could generate code to call a gnulib entry 
point: 


pushl _a 
pushl _b 
call noteqsi2 


The gnulib would contain code compiled with a different native compiler than GCC, one that can deal 
with a != expression: 

noteqsi2(n,m) { 

return(n != m); 

} 

This is a sneaky way to leverage an existing native compiler to fill out voids in GCC. Surprisingly, this 
works with our cross-host in most cases. We implemented a replacement for gnulib only as needed (few 
arc ever called). 


83 



We ran into an entertaining problem when we first moved the compiler onto the 386. Because we no 
longer needed our cross-host modifications to GCC, we started recompiling the stock version of GCC, 
including gnulib, with the only compiler we had on our nascent BSD UNIX system, namely GCC. GCC 
generated code that would call the support library, which in turn would then call itself to implement the 
same support, and so on ad infinitum! This is another minor example on the lack of native support for 
GCC in the then standard release. It is expected that GCC 2.0 and later versions will better address these 
and other cross-support issues. 

GCC Support Calls to Replace GNULIB 

In addition to the normal subroutine libraries found with BSD, two support subroutines are needed. GCC 
handles all ANSI C operations by generating the appropriate 386 instructions, with the exception of 
floating point conversion to signed and unsigned integers. In Listing One] (page xx), fixdfsi() manages to 
take a double precision floating point argument (a df) and turn it into a signed integer (an si, or small, 
within a machine word, integer). In Listing Two (page xx), fixunsdfsi() likewise takes a double-precision 
floating point argument and returns an unsigned (uns) integer. These functions use the 386 numeric 
processor integer truncation features to return the appropriate values. Because there is no direct method to 
convert a floating point number to unsigned format, we detect the condition (for example, above the most 
positive number possible), reduce the value prior to conversion (so it will fit into a signed value), then add 
back in what we subtracted after conversion, thus avoiding overflow. 

Choosing a Sensible Cross-Host 

Our ad hoc modifications of GCC resulted in a cross-compiler that would provide a considerable amount 
of language support, but it had limits. We also needed to consider the following: include file differences, 
byte sex, floating point format, inline assembly code, table generation programs, hardware page size, and 
object libraries. Some of these areas were so pervasive and important that they were primary 
considerations when we selected our cross-host. 

By selecting an appropriate cross-host, we minimized a number of problems, including compatible byte 
sex, structure data alignment, program size, and existing tool set. Floating point data format turned out to 
be a minor concern because few programs in the early utilities group require it. Thanks to the IEEE 
floating point standard, this becomes easy as most post-VAX period processors support the same format 
(modulo byte order). Obviously, our job would have been simpler if we already had 386BSD up and 
running and then had to port it, so what we looked for in a cross-host was something very similar. Oddly 
enough, a C compiler hides most of the native machine’s instruction set, so the least important part is the 
cross-host’s processor architecture. Operating system version and program development tool similarity 
count for much more. 

Those more dogmatic, gutsy, or energetic might say that we simply avoided the hard parts. They are quite 
correct. What hardships we did endure in cross-tools were more than enough for us. 

Where Do We Go From Here 

Now that we have created a stable cross-tools environment, we can get on to the last of our initial 
utilities—the initial root filesystem. In our next article, we will examine the minimum requirements which 
must be met to run a UNIX system, and the interrelationships between different UNIX files and utilities 


84 




needed during the various stages of our 386BSD port. We then create a root filesystem containing, among 
others, /etc/init, /bin/sh, /dev/console, and /bin/ls (a token program), and debug it via the standalone 
utilities. We also discuss some of the problems encountered in filesystem downloading and validation 
procedures. 

Brief Notes: Copyrights, Copylefts, and Competitive Advantage 

Usually when we discuss a piece of software, we attempt to enhance our understanding with a program or 
fragment of code which illustrates the topic. Therefore, it is quite frustrating to discuss as major a tool as 
GCC, where the code is available to anyone upon request but we are prevented by the "copyleft" from 
showing you any code fragments. As such, we feel it important to examine the history and some effects of 
the copyleft. 

The copyleft on GNU software was born out of rather turbulent circumstance. In the mid-1980s, a number 
of commercial entities made a practice of "appropriating" software developed at MIT and other 
universities and placing their own copyright on it. Richard Stallman, then (and still) at the MIT Media 
Lab, was involved with some early LISP software development, and experienced first hand the ruthless 
and bloody battle between Symbolics and LMI over LISP software enhancements. At the same time, 
AT&T was leading the forefront in the development of license agreements for UNIX, though not investing 
much at that time in the development of UNIX itself. This obvious (and still successful) locking up of 
research led Stallman and others to work on software projects which would be unencumbered by licenses, 
copyrights, and other restrictive means. Stallman's EMACS for the PDP-10 was one of the first visual 
editors available without those restrictions. 

While commendable in theory, the practice was quickly thwarted by the success of Gosling’s EMACS, a 
C-based version of Stallman’s EMACS, which ran under UNIX. As more use was made of Gosling's 
EMACS, companies began to support it, add new features, and so forth, until finally it was locked-up by 
the vendors. Of course, it goes without saying that the changes to the code and new features were not 
returned to Stallman’s group for updates, since that would have impacted a vendor’s perceived 
competitive advantage. 

Basically, the copyleft was an extreme response to the excesses of a cutthroat market. While permitting 
redistribution, the copyleft attempts to maintain access to and control of changes in code, by requiring that 
source modifications be returned to the FSF for redistribution and by demanding that the source with these 
modifications be made available from that company to anyone for essentially a "copying" fee. A liberal 
reading of the license makes it practically impossible for a company to easily lock up the software. It also 
prevents a company from easily recouping its investment in further software development, enhancements, 
or support by eliminating its competitive advantage over its competitors. A large company can avoid this 
by developing or licensing needed software tools, but a small business or individual developer does not 
have access to these resources. 

Finally, the copyleft attempts to exert control over any discussion and analysis of the code itself in any 
printed medium, and states in part: "...The ’Program,’ below, refers to any such program or work, and a 
’work based on the Program’ means either the Program or any work containing the Program or a portion 
of it, either verbatim or with modifications ...." 


85 



Thus, according to the copyleft, a written examination of GCC, which utilizes some of the code itself for 
purposes of discussion, falls under the copyleft itself. This is a condition unacceptable to authors and 
publishers, because they make their income only from the publishing and distribution of written works, 
and not necessarily from software products. Perhaps this was an unintended side effect of the copyleft, but 
attempts to narrow it have been to no avail. 

The headlong rush towards "open standards," an oxymoron worthy of the military, is no solution either, 
but merely an effort to mask the implicit control, development, and innovation of a proprietary object by a 
vested interest by calling it "open." The only open standard is one that has an openly accessible model or 
example of the standard itself. Just as a mathematical formula in physics is meaningless without example 
problems and solutions, a standard based on a proprietary object is also meaningless without code 
solutions which justify its worthiness —and the code answer book to this open standard should not be 
subject to ransom through the use of "licensing" fees and anticompetitive product controls. Such a 
standard must also be equally accessible to those developing proprietary and nonproprietary works. This 
not only mitigates the inherent competitive disadvantage for the small innovator, but is also a disincentive 
to the development of proprietary "copycat" standards alongside the open standard, in an attempt to 
undermine its use. 


Recently, the trend at many universities and research institutions has been to permit access to 
university-developed code through simple copyright procedures which permit modification and 
redistribution with attribution. The copyright used by TeleMuse, for example, is similar to the University 


of California at Berkeley (UCB) copyright and is designed to be simple and direct; see |Figure 1 


Figure 1: The copyright used by TeleMuse in the 386BSD article series 

/* Copyright (c) date, narae-of-author. All rights reserved. 

* Written by narae-of-author, date-written. 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.*/ 


In addition, UCB copyrights currently prohibit use of the UCB name in products incorporating the 
software to avoid the appearance of an endorsement. 

According to Marshall Kirk McKusick, UCB CSRG Research Computer scientist and president of 
USENIX; "We have the capitalists with their copyright and the radicals with their copyleft. We are at the 
’copycenter,’ since we allow redistribution with credit to the authors. Our goal is to have as many people 
as possible use our software." In January of 1991, CMU adopted a variant of the UCB copyright for the 
MACH operating system. 

This different approach to copyright does not attempt to regulate the development and distribution of code 
as does the copyleft. Instead, software is made available with the full knowledge that it will be 
incorporated into many different projects. These projects, in turn, will ultimately enhance the international 
competitiveness of the computer industry itself, by allowing individuals and small businesses the same 
access to these development tools as large corporations. After all, it is the individual and small business 
which are the sources of innovation in our society. Anything less (including the copyleft) results in a 


86 




competitive advantage only for large companies with a vested interest in the status quo. 

The Free Software Foundation deserves high praise for leading the fight against locked-up software. Some 
GNU packages, such as GCC and EMACS, have been used by small firms and research groups to develop 
innovative and unique software and products, which would not otherwise have been feasible for these 
economically strapped entities. Even 386BSD might not have been possible had we not been able to 
leverage other resources like GCC. Flowever, as the climate in which the copyleft was developed has 
moderated, we hope that the FSF will moderate its stand as well, and at the very least permit unfettered 
discussion and analysis of the code in print. We have every confidence that there will continue to be a 
flow of new software back to the source from companies, individuals and research groups. 

It is time vested interest stalled offering innovative and competitive works and stopped preventing 
innovation through the "anticompetitive" use of copylefts, open standards, and licensing. Those who 
maintain a competitive advantage through the inappropriate use of these methods, instead of through true 
innovation, have done so at the cost of the competitiveness of the entire domestic computer industry. — L.J. 

[LISTING ONE] 


/* fixdfsi.s: Copyright (c) 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 1/90 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* GCC compiler support function, truncates a double float into a signed long. 
*/ 


.globl 

_fixdfsi 



fixdfsi: 




pushl 

$0xe7f 

/* 

truncate, long real, mask all 

fnstcw 

2(%esp) 

/* 

save my old control word */ 

f ldcw 

(%esp) 

/* 

load truncating one */ 

fldl 

8(%esp) 

/* 

load double */ 

fistpl 

8(%esp) 

/* 

store back as an integer */ 

f ldcw 

2(%esp) 

/* 

load prior control word */ 

popl 

%eax 



movl 

4(%esp),% 

eax 


ret 





[LISTING TWO] 

/* fixunsdfsi.s: Copyright (c) 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 4/90 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* GCC compiler support function, truncates a double float into unsigned long. 


87 



*/ 


.globl _fixunsdfsi 

fixunsdfsi: 


pushl 

$0xe7f 

/* 

truncate, long real, mask all */ 

fnstcw 

2(%esp) 

/* 

save my old control word */ 

f ldcw 

(%esp) 

/* 

load truncating one */ 

fldl 

8(%esp) 

/* 

argument double to accum stack */ 

frndint 

/* 

create integer */ 

f coml 

fbiggestsigned 


/* bigger than biggest signed? */ 

f stsw 

%ax 



sahf 




jnb If 




fistpl 

8(%esp) 



f ldcw 

2(%esp) 

/* 

load prior control word */ 

popl 

%eax 



movl 

4(%esp),%eax 



ret 




f subl 

fbiggestsigned 


/* reduce for proper conversion */ 

fistpl 

8(%esp) 

/* 

convert */ 

f ldcw 

2(%esp) 

/* 

load prior control word */ 

popl 

%eax 



movl 

4(%esp),%eax 



addl 

$2147483648,%eax 

/* restore bias of 2 A 31 */ 

ret 





fbiggestsigned: .double 0r2147483648.0 /* 2 A 31 */ 

Porting Unix To The 386: The Initial Root Filesystem 

The development of the initial root filesystem required for the 386BSD operating system kernel. 

Completing the toolset 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 BSD and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project. Lynne established TeleMuse, a market research firm specializing in the 
telecommunications and electronics industry. They can be contacted via e-mail at 
lynne@berkeley.edu.Copyright (c) 1991 TeleMuse. 

In previous installments of this project, we’ve tantalized you with the preliminaries to our porting project. 
We’ve discussed our initial plan of the port, a bootstrap of the system off MS-DOS, the standalone utilities 
that help us test out the basic protected mode mechanisms of the 386, and cross-tools for generating the 
BSD utility programs that will run off our BSD operating systems kernel. In our analogy of climbing K2, 
most of our equipment has checked out in use, and the route along the ridge to the peak looks clear with 
possible good weather. In fact, at the end of this installment, we will finally complete the preliminaries, 
and leave our base camp to start the major ascent. 


88 



We now examine the initial root file-system required for our 386BSD operating system kernel. Earlier in 
this series, we discussed the cross-tools used to create 386BSD utilities, but we did not mention how we 
got these utilities onto our target machine. We could load them as files onto MS-DOS — unfortunately, 
386BSD has no ability (initially) to decipher the organization of files on the disk. (Some programmers 
who have spent time with the FAT, clusters, and their ilk might consider this to be more of a blessing than 
a curse.) Again, keep in mind that the primary operating system focus in this port is UNIX and not 
MS-DOS, and that we are working on a research project, not a commercial release. 

We now embark on making a usable filesystem in order to hold the programs and files used by our newly 
ported system. The filesystem is a special data structure and functions that describe the storage of files on 
some means of bulk storage. It literally is a subsystem for reading, writing, creating, and destroying 
programs and data files on a media. Some programs and data files will need to be used by our operating 
system kernel immediately when it begins to run; the rest will be made accessible as the system is 
configured for use by the configuration programs, which will be run only after the system is completely 
underway. The first group of files will contain the programs that allow us to add (or mount) new 
filesystems, creating hierarchical or "tree-based" filesystems. Because trees grow from their roots, this 
filesystem will be known as the "root," or bottom-most of the filesystems. 

The kernel is the "heart" of UNIX, running programs inside of processes that it creates for that purpose, 
and satisfying program requests (system calls) as needed. Later, when we describe the formulation of the 
kernel operating systems program (hereafter called the "kernel") and its initialization, we will use this 
initial root filesystem. Thus, in starting our major ascent, we will begin the actual job of porting the kernel 
program. 

The Role of the Root Filesystem 

The utility programs on the root establish a primary environment to craft an arrangement of filesystems, 
introduce special systems functionality via the server processes (daemons), and configure devices for 
operation. In addition, the root filesystem possesses the utility tools used to fix, or reload if necessary, 
other filesystems. These tools are often used to fix the root itself if it is not badly damaged. By virtue of its 
small size and lack of actively modified files, the root usually survives intact when a system crash occurs. 
This is all the better for us because we need it to run the system and summarily fix all ills. Should it get 
destroyed, however, we must completely reload it by some means; that’s why some systems have "back 
up" root filesystems—just in case this actually happens. (With 386BSD, we eventually allow for root 
filesystem recovery and installation of the first root filesystem by means of a floppy root filesystem, which 
contains the tools to load the entire system over the network, via a serial port, or from a floppy or cartridge 
tape dump.) 

The root filesystem is a small but essential portion of disk storage. It provides enough functionality for the 
system to expand its resources to use storage other than the root itself, and configure operations based on 
arrangements mandated by current conditions. The root filesystem is also the starting point for all filename 
translations and path searches. As a result, a smaller root with fewer files to search through will generally 
improve file operations performance. 


89 



A Brief Review of the Root 


We will now briefly review the organization and location of various files and their responsibilities in the 
UNIX tree-structured filesystem. This is in no way intended to replace the more authoritative descriptions 
of the UNIX file tree (see The Design and Implementation of the 4.3BSD UNIX Operating System, by 
Leffler, et al., Addison-Wesley, 1989 for more information on this topic) but will outline what needs to be 
present in the root for minimal operation with our 386BSD system. 

In Example 1 , the root directory, we can see the base of the root filesystem containing all of the top-level 
directories and files in our 386BSD system. This listing, generated by the UNIX Is command (Is -1), shows 
file attributes, link count, ownership, file size, modification date, and filename. Three kinds of files are 
present here (as indicated by the first character of attributes): directories (d), symbolic links (1), and 
regular files (-). Files in the root serve the functions of installation, booting, system initialization, device 
configuration, basic utilities, system operations, and so on. 

Example 1: The root directory generated by the UNIX Is command (Is -1), shows file 
attributes, link count, ownership, file size, modification date, and file name. 


drwxr-xr-x 

2 

root 

1536 

Feb 

3 10:18 bin/ 

-rwxr-xr-x 

1 

root 

20480 

Sep 

4 21:02 boot* 

drwxr-xr-x 

2 

root 

1024 

Feb 

22 13:32 dev/ 

drwxr-xr-x 

2 

root 

1536 

Mar 

5 18:31 etc/ 

drwxr-xr-x 

2 

root 

512 

Dec 

7 12:41 lib/ 

drwxr-xr-x 

2 

root 

4096 

Dec 

7 12:41 lost+found/ 

drwxr-xr-x 

2 

root 

512 

Aug 

16 1990 rant/ 

drwxr-xr-x 

2 

root 

512 

Dec 

6 12:11 root/ 

drwxr-xr-x 

2 

root 

1024 

Dec 

8 12:45 sbin/ 

drwxr-xr-x 

2 

root 

512 

Sep 

19 09:18 stand/ 

lrwxr-xr-x 

1 

root 

12 

Jun 

4 1990 sys@ —> /usr/src/sys 

drwxrwxrwx 

2 

root 

512 

Mar 

5 18:31 tmp/ 

drwxr-xr-x 

2 

root 

512 

Jan 

26 22:12 usr/ 

drwxr-xr-x 

2 

root 

512 

Jan 

27 23:12 var/ 

-rwxr-xr-x 

1 

root 

319488 

Feb 

22 08:57 vmunix* 


Installation: /stand 

The /stand directory in our BSD root filesystem contains standalone programs to be loaded using the 
standalone /boot program and run directly on the machine, sans the presence (and possible interference) of 
the operating system. This permits us to run programs to test, format, or diagnose device behavior. Other 
programs (/stand/cat, /stand/ls, /stand/icheck) allow us to diagnose problems with the root filesystem, 
independent of the operating system. In addition, standalone disk bootstrap programs (/stand/bootwd, 
/stand/bootfd) reside here, to be installed by other programs (/sbin/disklabel) onto the disk media. 

Booting: /boot and /vmunix 

Two other standalone programs are worthy of mention here: /boot, the universal bootstrap used to load the 
system off any media, and /vmunix, the operating system kernel proper. According to our early porting 
plan, we actually use these files last, because we load our system off of MS-DOS instead of from the BSD 
root filesystem. For those unfamiliar with UNIX, the only use of the operating system’s executable file 
after bootup is to provide information on symbolic references inside of the kernel, which are run to a very 


90 




few nonessential programs. In other words, although our system would continue to run if this file were 
overwritten or deleted, it probably would not boot in these cases. 

Initialization: /sbin/init, /dev/console, and /bin/sh 

When the system starts operation, it first executes the program in the file /shin/ init, which initializes the 
system and prepares it for operation. The system, as it starts, is completely mute otherwise. In the minimal 
case, the system is started "single-user" — init manages to configure the system to execute commands from 
the console device (on a PC, the keyboard, and display). This resembles the command mode that MS-DOS 
systems provide on a standard boot-to-command interpreter. In this case, init opens the console 
(/dev/console) and executes the command interpreter or shell (/bin/sh) . Thus, the minimum files we need 
(in addition to booting mentioned earlier) are /sbin/init, /dev/console, and /bin/sh. If any of these are 
lacking or damaged, UNIX cannot run, and we will not get a prompt from the command interpreter. Of 
course, in order to do something useful, we’ll also need the fides that correspond to commands presently 
used to run the aforementioned commands from the interpreter. This is the minimum required to get our 
kernel running. 

Although many PCs frequently run UNIX with a sole user, /sbin/init can also prepare the system for 
multiuser or multitasking operation. In this case, init runs the command interpreter on a file of commands 
(/etc/rc) commonly referred to as a "shell script." This in turn calls other shell scripts for network, device, 
and server process invocation. 

Server processes, which provide for a variety of services available with UNIX systems are often referred 
to as "daemons," as they attempt to do work invisibly. This is a play on Maxwell’s daemon, who would 
merrily put hot molecules in one box and cool molecules in another box, thus (?) violating the Second Law 
of Thermodynamics. (The proof fails when the daemon acquires so much energy in rapid collisions from 
highly vibrating molecules that it must radiate the energy as heat, thus perturbing the system. See 
Feynman's Lectures on Physics, Volume 1, for more information. You just can’t get something for 
nothing, can you?) 

Among other things, the system may now perform housekeeping functions: fixing any broken filesystems 
it can, erasing temporary files and other garbage, adding filesystems (both on the computer and over the 
network), and connecting the system into the world. 

Traditionally, all these commands provide little output as they are launched —which can occasionally 
confuse more than reassure. One popular computer author, unfamiliar with UNIX, complained of feeling 
quite uncomfortable when a UNIX workstation flashed him the message, "starting standard daemons." 
Perhaps he thought he needed the help of a system exorcist! 

In multiuser operation, the system depends on all the functionality that has been configured, including 
indications of service availability through the appearance of a login prompt. We find it amusing when our 
386 PC laptop prompts us for a login account name and password, as if we are competing with hundreds 
of users for access to the machine! On the other hand, our little 386 laptop, running 386BSD, has about as 
much disk space and memory and is three times the speed of the PDP 11/70 that the University of 
California used to run 50 to 70 students at a clip. UNIX regards little PCs and systems with hundreds of 
terminals in the same way — a login prompt per customer. Configuration: /dev and /etc 


91 



Hardware devices on UNIX are accessed through special filenames in /dev. For our filesystem to work 
correctly, we must have the appropriate device files already made. Otherwise, the utility programs will not 
be able to access the devices, even if the operating system has drivers that work with the underlying 
hardware. These files are special because they are made with a special program (/sbin/mknod) which 
creates an association between the file and a software driver in the kernel. A shorthand script program 
(/dev/MAKE-DEV) provides a way to make these files symbolically. With 386BSD, we must make the 
console (/dev/console) and the root filesystem’s device (/dev/wdOa) before we run; it’s wise to make other 
special files at this time, too. 

Besides configuring device filenames, we need to specify device configuration for disk drives (/etc/fstab), 
terminal lines (/etc/ttys), and printers (/etc/printcap) to describe device characteristics. One criticism of all 
UNIX systems has been the need to wade through a plethora of ad hoc configuration files for device and 
program use. Most of the system configuration files in this project, however, can be found within the /etc 
directory. 

Utilities: /bin and /sbin 

The basic utilities needed for operation of UNIX are found in the two directories: /bin and /sbin. /sbin 
contains supervisory commands not generally useful to ordinary users, but important for system operation 
and system management, /bin contains basic commands useful to all UNIX users — kind of a core group. 
Both of these directories are kept short and small to minimize the size of the root and the time it takes to 
search for a command. All other commands (hundreds, usually) arc found in the additional filesystems that 
become active when UNIX is brought up multiuser. To this end, it is important to note that /sbin/mount 
and /sbin/umount are used to mount and unmount those additional filesystems. 

Operation: /tmp and /var 

Once in operation, the /tmp directory is used to store temporary files from editors, formatters, compilers, 
and assemblers, as needed, /var is a directory that holds various short-term data, such as usage accounting, 
security logs, incoming electronic mail, crash dumps, printer spooling, and runtime program databases. 
Frequently, these two directories are separately mounted filesystems, especially on systems where these 
kinds of files take up much space. 

Other Directories: /lib, /mnt, /usr, /root, and /sys 

Finally, we have a group of files that don’t fit any of the above categories, /lib contains object libraries and 
runtime start-off routines to allow C and other languages to run on the system, /usr is an empty directory 
used as a mount point to attach a much larger filesystem to — one that contains everything else not on the 
root in the way of utilities, object libraries, include files, documentation, and system source, /mnt, also an 
empty directory, is used as a mount point for optional filesystems to be attached to when needed, /root 
contains the home directory for the superuser account (root), keeping it separate from the actual root 
directory of the system, /sys is our sole example here of a symbolic link — a file type that provides a 
shortcut within the filesystem to another location in the filesystem tree. In this case, /sys hides a reference 
to /usr/src/sys, so when the filesystem associated with /usr is mounted, a reference to a file like 
M /sys/i386/i386/locore.s" is satisfied with a reference to the file "/usr/src/sys/i386/i386/locore.s". 


92 



Filesystem Creation 

Normally, we would use our ported system to create root filesystems, but we again run into a 
"chicken-and-egg" problem, because we need a finished system to create the archetypal root filesystem 
from which we make all others. So, in the typical "break the egg" and "cook the chicken" way we resolve 
all minor paradoxes, we make the first filesystem on our cross-host by special means. We either find a 
cross-host with identical key data structure characteristics (byte order, structure field alignment, and 
structure packing) or write a transformation program to turn our crosshost’s filesystem format (via 
stretching, swapping, and shrinking) into a 386BSD-compatible form. The result is a file of bytes that 
contains an image of what the filesystem should contain on the PC’s disk drive. 

If we were stalling this project now, we might consider a novel alternative method using the BSD NFS 
(Network FileSystem) code. We would then run our 386BSD kernel in a "diskless" fashion, passing all file 
operations over the network to be satisfied by an NFS server host. We could use any NFS server to 
provide access to our initial root filesystem. Oddly enough, this would hide not only the cross-host’s 
filesystem format, but the cross-hosts operating system as well. Conceivably, one could even use a 
non-UNIX cross-host. All of this is made possible by NFS’s file abstraction mechanism, which converts 
filesystem data to a common external representation via its internal XDR (external Data Representation) 
library. 

Filesystem Downloading 

With our filesystem image in a file, we can download it using either Kermit, NCSA Telnet, or some other 
file transfer utility that can copy a binary image from our cross-host to the PC under MS-DOS. In the early 
stages, before the kernel successfully ran processes, small filesystems of a few hundred Kbytes (principly 
/dev/console, /sbin/init, /bin/ sh, and /bin/ls) could be downloaded as needed over the serial ports using 
Kermit. As success with the kernel increased, so did the size of the root filesystem, because the focus of 
the project moved from minimal operation to proving the kernel by means of increasingly larger utilities. 
This affected us in three ways: Serial link downloading took too long; our MS-DOS partition limited the 
size of the filesystem across which we could copy; and even a single byte change in a single file required a 
complete filesystem download to affect modification. 

Having downloaded the filesystem, we used the copyfs program (see "Initial Utilities: Three PC Utilities" 
in DDJ, February 1991) to install it in a partition on the hard disk, separate from MS-DOS. The BSD 
kernel disk driver was also modified to relocate what it considered the beginning of the disk to this point 
so we could share the disk with two systems. Copyfs would place the image of the filesystem onto the 
absolute disk storage blocks without any translation, making the image real. 

Filesystem Debugging 

At this stage, it is considered good practice to check the filesystem on the PC. We used the standalone 
utilities (/boot, /stand/cat, /stand/ls, /stand/icheck) to verify that the filesystem was correct for use with the 
kernel. However, even before having an operational system, we can validate our filesystem with our 
standalone system (see "Initial Utilities: The Standalone System," DDJ, March 1991), because it has the 
ability to interpret the filesystem data structures, /boot can be used to check for the presence of files and 
directories by attempting to boot from a file. For example, one can try to boot from /stand/ls, with the 
proviso that 7" be a directory that has the "stand" directory in it, and that "stand", in turn, contain "Is"—an 


93 



executable file. If the given file cannot be opened, /boot will tell us why. Is, like its user-mode utility 
counterpart, lists the contents of a directory on a disk, so we can check to see if the contents are correct. 
Similarly, cat shows the contents of an ASCII file, so we can check to see that the ASCII files present 
have the appropriate contents and that fence-post or data translation problems have not corrupted the files. 
Finally, /stand/ icheck, the largest standalone program, can exhaustively check for filesystem consistency 
to make certain that all of the filesystems’ data structures are undamaged. We can verify this by running 
the same icheck program on the cross-host, ensuring that the filesystem is identically consistent on both 
the cross-host and the target system. 

These validation techniques independently test file contents separate from file system data structures, or 
"meta data," on the off-chance that we are somehow corrupting the contents of files when we create the 
filesystem. It’s important to realize that programs that check the filesystem have no way to check contents 
of files. Thus, the file contents may be completely mangled in ways that could still leave the filesystem in 
a correct state! 

What’s in a Filesystem? 

As we stated earlier in this article, a filesystem is a data structure designed to implement the abstraction of 
files and directories. As such, there are dozens of types of filesystems possible. Berkeley UNIX currently 
offers three flavors of filesystems: UFS, NFS, and MFS. 

• UFS, like many other filesystems, manages to impress its underlying files and directories on a bulk 
storage media such as magnetic moving head disks. In particular, UFS uses placement algorithms to 
schedule head movement and rotational delay to improve average filesystem effectiveness. 

• NFS, the Network Filesystem originally designed by Sun Microsystems, funnels program requests for 
files over a network connection, which is then satisfied by a server machine’s own filesystems. 
Consequently, these files can be located quite a distance away from the actual computer whose 
program is referencing a file. 

• MFS, a memory-based filesystem, stores temporary files in the processor’s virtual memory storage 
areas for rapid access to transient data. It evolved from RAM-based disks used on many MS-DOS 
systems and uses virtual memory to provide a way to keep active files present in RAM while 
gradually moving inactive portions back to the disk. 

Why Do We Need a Root Filesystem? 

Traditionally, the UNIX filesystem is used to hold the operating system and its bootstrap as ordinary files. 
This makes it convenient to create and install new versions of the operating system with the very same 
tools used to develop ordinary user programs. This arrangement also makes it possible to choose 
alternative versions of the operating system, and to run newer systems under development, or fall back to 
back-up versions if for some reason the default system is damaged and unusable. This flexibility presents a 
problem—how do you load the operating system which makes use of the filesystem if it’s already in the 
filesystem itself? 

As part of the bootstrap process, the computer loads bootstrap programs with an ever-increasing ability to 
manipulate the hardware and access files from the UNIX filesystem. In 386BSD, the ROM BIOS starts the 
process by reading the first block of disk storage off the disk, and then executes its contents as an ordinary 
program. This tiny program has the sole responsibility of reading in a program 15 times its size and 


94 



located on the next successive blocks on the disk drive. In turn, this larger program has the responsibility 
of deciphering the UNIX filesystem located adjacent to it on this disk drive, and extracting the next 
bootstrap program from the file "/boot" in the filesystem. This final bootstrap program can be arbitrarily 
large (bounded by physical memory) and can load programs from all possible devices on the computer. 
This bootstrap can also determine which device to load the operating system from, the configuration of the 
processor prior to boot, and power-fail or crash-recovery steps. It can also decide whether the system 
should automatically reboot itself or pause and await manual intervention to remove an obstacle inhibiting 
automatic reboot; it can be interrupted by an operator if he wishes to change his mind and insist on 
alternative actions. Thus, the bootstrap can be used to load other standalone programs that might be used 
for disk formatting, recovery, or installation, as well as loading the operating system itself (the file 
/vmunix). In a sense, when the bootstrap is loading, you might call the filesystem it is using the "boot 
filesystem"! 

A similar chicken-and-egg problem occurs when we decide to run the initialization process (/sbin/init) to 
initialize the subsequent user program operation of the system. Because UNIX systems only know how to 
execute a program from a file in a filesystem, we need a filesystem from which to execute files. Thus, the 
root filesystem is the first filesystem accessible, via a kind of "virgin birth." All other filesystems will be 
explicitly attached to it via the UNIX mount command, which tapes the base (or root) of the filesystem to 
be mounted onto an existing directory in the root (the "mount" point). 

Non-UNIX systems have an entirely different perspective regarding bootstrapping. Usually, the given 
system is kept on a special, dedicated location on the disk, frequently adjacent to bootstrap code. 
Sometimes, the equivalent of the UNIX /sbin/init program is also found in this special location. Therefore, 
these programs require special installation onto the disk, and the system does not require the concept of a 
"root" filesystem, because it does not require a filesystem to become active. 

Note also that we have one file-naming convention in UNIX, so that even devices are named just like 
ordinary UNIX files (/dev/wdOa or /dev/console, for example). This is different from MS-DOS or VMS, 
where two namespaces are present at any time: the device namespace (A: or DKOA:) and the file 
pathname (foobarbletch or [foo] bar,bletch;2). With UNIX, the filesystem is a central concept, along with 
the global way in which it is used and reused to provide a sole namespace for file objects. In a sense, the 
originators of UNIX felt this concept to be so important, that in follow-on-work (such as Plan 9, see DDJ 
January 1991), the filesystem is even more central to the system, by becoming a way of expressing 
interprocessor, window system, and program communications metaphors. 

The Filesystem Metaphor and its Importance in Future Work 

With all modern systems, we now use the filesystem metaphor underlying the basic syntax and semantics 
of the UNIX filesystem. As a result, the same file specification syntax known to all UNIX applications 
programs can be used to transparently access files embedded in archival storage systems, remotely 
manipulate files on remote systems of entirely heterogeneous design, store files on fail-safe redundant 
media, or a combination of these. We could even design a database filesystem where the filename 
directory path would describe a database query, with the "leaf” files themselves being the database 
records. The foresight of the originators of the early hierarchical filesystems (and the Multics Project) is 
now apparent, as these ideas come to fruition in a variety of research and commercial applications. As we 
continue to struggle with the complexity of our software systems, the use of powerful metaphors that unify 
many mechanisms within one becomes increasingly critical to the design and implementation of any 


95 



complex system. 


Porting Unix To The 386: Research & The Commercial Sector 

A discussion of the various demands placed on research and commercial operating systems, and how they 
differ. 

Where does BSD fit in? 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, which was the first virtual memory microprocessor-based Unix system. 
Prior to establishing TeleMuse, a market research firm, Lynne was vice president of marketing at 
Symmetric Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions 
or comments to lynne@berkeley.edu. Copyright (c)1991 TeleMuse. 

"The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing wax— Of 
cabbages—and kings—And why the sea is boiling hot— And whether pigs have wings." —Lewis Carroll 

At this point in our article series, the basic toolset for 386BSD development is in place, and we’re ready to 
begin the job of porting the kernel program. (Or to use the mountain-climbing analogy we’ve followed 
until now, we’ve completed the preliminaries and are ready to begin scaling the peak.) With this in mind, 
it’s a good time to pause and consider where we’ve been and where we are going. 

We’ve discovered over the course of this series that there is considerable confusion and debate among 
researchers, programmers, businesses, and other interests over the nature and role played in the computer 
industry by Berkeley UNIX in general and by 386BSD in particular. This is not surprising, given the 
direction of operating systems in the commercial sector such as AT&T’s System V, Release 4, Apple’s 
System 7, IBM/Microsoft OS/2, and others. As such, it has become crucial to differentiate these two 
sectors, examine the differing motivations and goals, and discuss some of the trends that will eventually 
tie these two worlds together. 

The most important thing to remember about Berkeley UNIX is that it is and will remain a "research" 
project. This means it is not designed with the needs of the commercial sector in mind — the University of 
California is not a development shop such as the SCOs of the world. BSD provides, in essence, the 
opportunity for operating systems, applications, networks, and other areas to evolve beyond the current 
requirements of the commercial sector to produce the technology required for next stage efforts. This is a 
demand of research — to get on with new work and not simply stagnate. 

Commercial operating systems releases have a far different agenda, however. While much of it is 
self-serving (such as ABI, which we think should actually stand for "AT&T Binary Intolerance"), there is 
a method to the madness. Commercial releases are tied to the past. In fact, the tie is so strong that even 
when there is a critical need to offload some past burdens, a company finds it politically impossible to do 
so. We are reminded of Fred Brooks’s classic work. The Mythical Man-Month (Addison-Wesley, 1975), 
and his discussion of the infamous (but popular) IBM OS/360 operating system. This operating system 
grew bigger and bigger and bigger in order to meet the perceived demands of their customers. And as it 


96 



grew bigger, the number of bugs grew as well (though not at the same rate). As Brooks reflects on this 
project, he pinpoints a key issue (page 122 - 123): 


Lehman and Belady have studied the history of successive releases in a large operating system. They find 
that the total number of modules increases linearly with release number, but that the number of modules 
affected increases exponentially with release number. All repairs tend to destroy the structure, to increase 
the entropy and disorder of the system. Less and less effort is spent on fixing original design flaws; more 
and more is spent on fixing flaws introduced by earlier fixes. As time passes, the system becomes less and 
less well-ordered. Sooner or later the fixing ceases to gain any ground. Each forward step is matched by a 
backward one. Although in principle usable forever, the system has worn out as a base for progress. 
Furthermore, machines change, configurations change, and user requirements change, so the system is not 
in fact usable forever. A brand-new, from-the-ground-up redesign is necessary. 

In sum, it might have been simpler to abandon further work on this titanic system (400K of assembler 
code, a princely sum at the time) and go on to new operating systems. 

Because of its research agenda, Berkeley UNIX is less concerned with issues such as ABI. Applications 
interfaces are quite properly handled outside the kernel, usually with a library. Eventually, antiquated or 
nonstandard interfaces are brought up to speed with newer technology, and programmers use the library 
less and less, until finally most delete it from their world. Researchers cannot afford to work with bloated 
kernels, stuffed full of arcane and inappropriate software. Research operating systems must be lean, mean 
computing machines. 

So, while everyone longs for the latest innovation, the BSD Maserati, not everyone feels comfortable with 
the incredible power it provides — not everyone is a race car driver any more than everyone is an operating 
systems programmer. BSD provides the mechanism for tremendous new opportunities, but it doesn't have 
a lot of safety nets. You can just as easily crash and burn with BSD, and have no one to blame but 
yourself. 

Commercial systems vendors offer customers a nice, big, memory-guzzling Oldsmobile of an operating 
system ("This is your father’s operating system") with all the same features everyone has seen since 
childhood. And when in doubt, more is added in the kernel until everyone is satisfied. Because they do not 
want to increase support overhead, vendors try to prevent "crash and burn" occurrences by making it so 
safe that you are protected from yourself (and, by the way, if you do circumvent the controls, you can’t 
blame them—they tried to save you). At least, this is the intent: Give the customers what they want (within 
reason and while maintaining control) and try to minimize support headaches. 

We said in January that "because standards by accumulation just don’t work, we strive in 386BSD to 
avoid such nonsense." This was not an idle statement, but a cornerstone of our specification. We are not 
the first to observe the problems that arise when bloated kernels become a mainstay of either research or 
commercial offerings (as Fred Brooks discussed in his book). In several current commercial offerings, the 
complexity has become so great that the kernels have become difficult to maintain and impossible to 
orient toward the future. Customers lose what they so desperately desire in an expensive commercial 
release, namely bug-free software and timely support. This trade-off for flexible and innovative systems is 
beginning, like OS/360, to sink under its own weight. 


97 



It’s ironic that this would happen to UNIX, which was predicated on the essentially "minimalist" work of 
Thompson and Ritchie. This problem is not restricted to the commercial sector, however. In fact, one 
reason many are less than enamored with MACH is that its "microkernel" is roughly comparable in size to 
the 386BSD kernel — and yet it requires much more memory to be useful. 

The final question now becomes, "What do I use now?" For the user dependent on a proprietary database 
or accounting software, the answer is quite simple — just continue using what you have been using. Unless 
there is a compelling reason to invest in new technology, you just disrupt your business and your workers 
for no good reason. Eventually, some aspects of Berkeley UNIX will be integrated into commercial (and 
other research) releases, but don’t anticipate it soon. 

For those who must look to the future, however, such as applications, networking, and operating systems 
designers, Berkeley UNIX will continue to be a source of the innovative new technology required for new 
products and new functionality in a competitive world economy. Businesses and programmers should 
keep themselves current with these research trends, for ready incorporation in the commercial market. By 
the way, sometimes it’s nice to drive a Maserati. 

The 386BSD Project and Berkeley UNIX 

The 386BSD project was established in the summer of 1989 for the specific purpose of porting the 
University of California’s Berkeley Software Distribution (BSD) to the Intel 80386 microprocessor 
platform. 

Encompassing over 150 Mbytes of operating systems, networking and applications software, BSD is a 
fully functional complete operating systems software distribution. The goal of this project was to make 
this cutting-edge research version of UNIX widely available to small research and commercial efforts on 
an inexpensive PC platform. By providing the base 386BSD port to Berkeley, our hope is to foster new 
interest in Berkeley UNIX technology and to speed its acceptance and use worldwide. We hope to see 
those interested in this technology build upon it in both commercial and noncommercial ventures. 

In each of these articles we will examine the key aspects of software, strategy, and experience that make 
up a project of this magnitude. We intend to explore the process of the 386BSD port, while learning to 
effectively exploit features of the 386 architecture for use with an advanced operating system. We also 
intend to outline some of the trade-offs in implementation goals, which must be periodically re-examined. 
Finally, we will highlight extensions that remain for future work, perhaps to be done by some of you 
reading this article today. 

Currently, 386BSD runs on 386 PC platforms and supports the following: 

• Many different PC platforms, including the Compaq 386/20, Compaq Systempro 386, and 386 with 
the Chips and Technologies chipset, any 486 with the OPTI chipset, Toshiba 3100SX, and more 

• ESDI, IDE and ST-506 drives 

• 3-1/2-inch and 5-1/4-inch floppy drives 

• Cartridge tape drive 

• Novell NE2000 and Western Digital Ethernet controller boards 

• EGA, VGA, CGA, and MDA monitors 

• 287/387 floating point including the Cyrix EMC 


98 



• A single-floppy standalone UNIX system, containing support for modems, ethernet, SLIP, and 
Kerrnit to facilitate downloading of 386BSD to any PC over the INTERNET network. 


—B.J. and L.J. 

Porting Unix To The 386: A Stripped-down Kernel 

The 386BSD basic kernel, incorporating a unique ’recursive’ paging feature that leverages resources and 
reduces complexity. 

Onto the initial utilities 

This article contains the following executables: 386BSD.791 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. Copyright (c) 1991 TeleMuse. 

Much has been made of the preparations we have required before we could embark on our present project. 
While that’s all well and good, at some point we really would like to get on with our adventure and start 
the main assault — the kernel itself. Our roundabout development of tools and equipment allowed us to 
scope out the weak points in the 386BSD specification, with the added bonus of enhancing our experience 
and confidence. By following a disciplined set of guidelines and procedures, we minimized one of the 
most demoralizing activities of all — trying to build our system without any idea as to where the bugs (or 
failure modes) lie, especially those enormously irritating compiler bugs induced by driver implementation 
bugs. 

Now we arrive at the point in which we would like to create a "strippeddown" kernel. At this stage of our 
work, our primary concern is with the machine-dependent portions of the kernel that install it into the 
position to execute processes (via the bootstrap procedure) and prepare the system for initialization of the 
minimum machine-independent portions of the kernel (processes, files, and pertinent tables). 

Our 386BSD kernel is a kind of "virtual machine" (not to be confused with the "virtual" in "virtual 
memory"), where functions underlie other functions transparently. When the system is initialized, it can 
use portions that require little direction to initialize even larger portions. Thus, this virtual machine 
assembles itself tool by tool, much like a set of Russian dolls. The machine-dependent kernel initialization 
is the innermost of the dolls — the kernel of the kernel around which all is built. The next outer layer will 
then be built by the kernel’s main() procedure (to be discussed later), which in turn initializes higher-level 
portions of the kernel. 

While our basic approach toward "wiring" the 386 for operation with the machine independent BSD 
kernel is similar to that of our standalone system (see DDJ March 1991), the details are now very 
important. In fact, we’ve changed so much since our discussion of the 386BSD specification (DDJ January 


99 




1991) that even the specification needs to be revised in several key areas such as the virtual memory 
system and per-process data structure. In addition, the most recent versions of 386BSD (less than a month 
old) incorporate the unique feature of the 386 architecture in a form of "recursive" paging which not only 
leverages resources to the hilt, but also reduces complexity enormously. (See text box "Brief Notes: 
386BSD Recursive Paging.") 

The Basic Structure of the UNIX Kernel 

The structure of the BSD UNIX system is akin to that of an onion. Consisting of layer upon layer, the 
outside layers of the BSD onion are those processes visible to the computer "user," while the inner layers 
hide processes the user needn’t see, such as those relating to the hardware. (This can also be called the 
"Almond Roca" kernel, if you prefer sweets.) 

The operating system kernel lies in the innermost layer. Its primary responsibility is to provide the 
appropriate level of utility services upon which other programs and facilities are built. The kernel itself 
consists of an inner "machine-dependent" portion and an outer "machine-independent" portion. The center 
of the onion could be considered the raw hardware itself. 

In UNIX parlance, the kernel is typically divided into the "high kernel" and the "low kernel." The high 
kernel is concerned with UNIX abstractions, such as files, processes, and other related objects. The low 
kernel, in contrast, is concerned with the functionality of the kernel — how to implement the abstractions 
with machine-dependent mechanisms for operation. 


[More Details. 


To some degree, all operating system are designed with this basic "onion model in mind. However, the 
designers of competing systems spend a great deal of time determining what items belong in a given layer. 
Unlike the ISO OSI layer model which comprises computer systems networking, no agreement yet exists 
on the ideal model for operating systems design. 

Many operating systems prior to UNIX did not precisely delineate the operating system and the user 
programs, and resulted in quite a wide variation in layering. Some operating systems (such as VMS, RSX, 
and OS/370) have thousands of different entry points and functions — many chosen on an ad hoc basis. For 
example, some user programs would call directly into the operating system at a point known to be past a 
register-save sequence, because the writer of the program would assume that it didn’t cause a problem and 
might even speed up the program slightly. Even nonuniformity within operating systems can occur, such 
as when a devotee of one particular system adds a facility which relies on a system call differing radically 
from the rest of the system. In these cases, the layering is blurred between the user application program 
and the given operating system — not surprising considering the various ways that the same effect can be 
achieved. 

UNIX, a fundamentalist "return to the basics" approach, was a philosophical as well as design issue. 

Unlike these other systems mentioned, UNIX has a very small number of system calls (typically fewer 
than a couple hundred), and, as such, must leverage them for maximal operation. This "simplicity" of 
design can be found throughout its structure. In fact, a suspect subsystem within UNIX itself is often 
branded as "unlike UNIX" due to nonmodular or clumsy design. Ironically, this has been the case with 
software that has been part of UNIX for years and widely used. 


100 



Part of the reason UNIX adherents (and its designers) appear to be "zealots" of the minimalist view is that 
the pressure to add "just one more" system call is quite great, and this one area alone has become a point 
of highly charged and subjective debate as to where to draw the line. This is one reason why a single 
UNIX "standard" has yet to emerge — the lack of consensus on this and other crucial issues. 

Incremental Strategy 

Despite the "purity of essence" debates, UNIX has grown like a weed. (Any undesired plant is a weed, and 
one could say the same about UNIX, at least initially — just ask DEC or IBM or Apple.) It has grown 
because the ever-increasing hunger for applications, and the functional infrastructure needed to support 
them, to simplify or enhance work is insatiable. Doubtlessly, UNIX will continue to grow in size and 
popularity (although some of us would prefer it grow in a graceful and planned manner). However, there 
are times when the "essence" of UNIX must be examined and understood, such as when a native port is 
conducted. By restricting UNIX functions via conditional compilation, we can work on making the core of 
the kernel functional. Once the core is functional, remaining portions can be added incrementally. This 
incremental methodology allows us to backtrack when errors or malfunctions occur. In addition, we 
always have recourse to the previous version if necessary. 

Composing the Basic Minimal UNIX Kernel 

What constitutes a minimal UNIX kernel? This varies according to the kind of port desired and resources 
available. For example, one alternative plan we almost selected involved using the Network FileSystem 
(NFS) instead of working with the hard disk. If we had chosen that approach, code for implementing an 
NFS client, along with the networking code, would have been a mandatory component of our minimal 
port, while the disk driver and related support would have been relegated to a less-important role. 

Since we are concentrating on the machine-dependent portions of the minimal kernel, we must pare-down 
considerably what is required. For 386BSD, we opted for a traditional port (see DDJ March 1991) that 
relied on a hard disk, a console interface (via the keyboard and display) and the process reschedule clock 
(via the interval timer). All network protocol and related system services (interprocess communications) 
were removed through conditional compilation. Any extended functionality in the main body of the kernel 
meant to accelerate operations (for example, macros, hash lookups, short circuit evaluation) was also 
avoided — after all, it makes no sense to improve the speed of something that does not even run to begin 
with. Also, algorithm improvement is not always a machine-independent phenomenon. 

The point in generating the "tiniest" kernel imaginable is to simplify the port. At this stage, we never 
expected to run something this small as a complete "production" system. As we incrementally added 
subsystems to our minimal kernel, we got a clearer understanding of the impact of each on the kernel. 

Even within this small system, redundant code and interfaces occurred. As such, a small amount of 
patience in this area always pays a handsome dividend later. 

Our minimal kernel was created by adding conditional compilation (#ifdefs) statements to the BSD kernel 
source code to defeat the subsystems for networking, TCP/IP protocols, routing, NFS, interprocess 
communications (other than pipe), user process debugging, and the related services on which they depend. 
In addition, since we only needed drivers for disk, display, keyboard, and process scheduling clock, we 
could scale down the drivers and omit autoconfiguration code. After making this operational, fleshing out 
the drivers, and adding back in support to run debuggers caused the kernel to grow considerably. 


101 



With all the concern these days over "bloated" kernels, with the consequent support, extensibility, and 
other problems, it is instructional to examine a sample listing of what can be considered a "stripped-down" 
150-Kbyte kernel; see Figure 1(a) , page 85. (By the way, by abiding by the rules outlined earlier and by 
using only the drivers necessary for basic functionality, our early initial 386BSD kernel was less than 100 
Kbytes in size — and was both debuggable and extensible.) As an example of how this differs from a 
production system. Figure 1(b) , page 85, contains the same breakdown for a more recent system (using a 
derived MACFI virtual memory system, NFS, TCP/IP, multiple disk and Ethernet controllers, and other 
features added). 

How Can You Be in Two Places at Once...? 

By design (DDJ January 1991), we want our operating system kernel to run at the top of the virtual 
address space (currently, location OxfeOOOOOO) as in Figure 2 . However, our PC memory is mapped into 
the lower portion of the address space before memory management is turned on. Thus, our bootstrap 
program must load the kernel program into low physical memory to run, even though the kernel has all of 
its absolute addresses directed to the top, where no physical memory is present! 

For the short run, the kernel executes code which manually compensates for this problem, especially in the 
case of the data operands. Code operands are stored as relative offsets that work regardless of location 
(so-called "PIC" or Position Independent Code). PIC coding can be quite cumbersome (see Listing One , 
page 85, from "start" to "begin"). Fortunately, the actual amount of code required to operate in this fashion 
is small — just enough to enable our memory relocation hardware (the MMU). 

As we recall (see DDJ January 1991), the 386 MMU utilizes a "two-level" paging scheme in order to 
determine the physical page frame number — the actual address of physical memory underneath the virtual 
address. This mechanism works by splitting the incoming virtual address into three parts: 10 bits of page 
table directory index, 10 bits of page table index, and 12 bits of offset within a page. The page table 
directory is a single page of physical memory that facilitates allocation of page table space by breaking it 
up into 4-Mbyte chunks of linear address space per each of its 1024 PDEs (Page Directory Entries), which 
determine the location of underlying page tables in physical memory. Each PDE-addressed page of a page 
table contains 1024 PTEs (Page Table Entries). A PTE is similar in form and function to a PDE. The 
major difference between a PDE and a PTE is one of hierarchy: A PDE selects the physical page frame of 
PTEs while a PTE selects the physical page frame for the desired reference. Once the frame offset 
least-significant address bits are obtained, the final address is determined. This two-level mechanism is 
quite elaborate, but it elegantly allows for the sparse allocation of address space, so that the whole address 
space or even all of the address space mapping information need not be present. In contrast, a one-level 
mapping scheme would require 4 Mbytes of real memory per task for mapping alone — too much even for 
many modern systems. 

To run our kernel program with the MMU enabled, we must build page tables that describe the physical 
location of memory storing the program, as well as the mode of access allowed to each "page" (otherwise 
known as the "allocation granularity" of the MMU) as in Figure 3 . In addition, the MMU must have a 
page directory table describing where it can find all of the possible 1024 page table pages which allow it to 
access any part of its 4-gigabyte address space. In a way, the 386 MMU acts almost as a "coprocessor" to 
the 386 CPU, interpreting two data structures (page directories and page tables) on behalf of the CPU and 
translating virtual addresses into physical ones. (The MIPS RISC MMU is actually referred to as a 
coprocessor.) 


102 





While our code dutifully builds our page tables and page directory table to make the above mapping work 
(see Listing One , near comment "build page tables"), we are still left with a dilemma: How do we turn on 
mapping running at a "high" physical address when we are still running at a "low" address? In other 
words, how do we make the CPU switch from one address to another? Well, the answer could depend on 
an understanding of many hardware-related issues (such as the size of instruction prefetch queue, 
instruction pipelining, address translation overlap, multiprocessor arbitration, and so forth), while avoiding 
irregularities or non-standard approaches. For example, some systems programmers have gotten away 
with murder over the years by assuming in the software that the processor already has the instruction after 
the MMU is enabled (not always true, mind you), instead of verifying it as they should in all cases. This 
situation is analogous to people who dive over three lanes of traffic at the last second just to make a 
freeway exit. Most of the time it works, but occasionally it doesn’t. In this case, Superposition of Matter 
(unlike radio waves) doesn’t hold (although Total Conservation of Mass does hold). A disaster, possibly a 
crash, occurs — so it goes with systems as well. 

Not that this area will get any easier, either, what with the even more esoteric versions of the "N"86 on the 
drawing board. (By the way, has anyone trademarked "N"86 yet?) One must anticipate where the 
technology will be taken. For example, one might need to assume that the instruction queue always 
consists of at least one instruction. As technology shifts, features which are relied upon by even the most 
careful of programmers can be abandoned for better ones (for example, a fully pipelined instruction 
execution with pipelined MMUs that update address space state for branch prediction use). 

In this case, the appropriate path around all of these hazards is simply to map the bottom of address space 
to the same location — or "double-map" the same program. This way, it will work regardless of what the 
hardware designers do. We could also have replicated the page tables to map the bottom of address space 
where the kernel program begins, but we would end up duplicating the same page tables used at the top of 
the address space, and that would be very wasteful. Instead, we just double-map the bottom page directory 
entry (the one that maps the bottom 4 Mbytes of address space) to the page directory entry that maps the 
kernel. Once accomplished, the MMU can now be enabled. 

Now that we are "running virtual," we need to leave the "bottom" of address space by jmping from low to 
high (see DDJ March 1991). However, we must avoid PIC from the jmp instruction, or else "jmp high" 
will keep us low. Because "clever" assemblers and loaders transparently assume that PIC code is desired, a 
quick solution is to push a constant on the stack and execute a return (ret) instruction. 

UNIX as a Subroutine Call 

Once you are running "high," you need to install a stack. The stack should be placed in the process’s 
portion of address space. This way, it can be easily changed when we move from a process or task to 
another, because each process must have its own kernel stack. In a way, 386BSD functions like a 
subroutine call for a user process, with its own internal calls stacking on this separate stack, unlike the 
"jump to system" program approach used on systems such as TOPS-IO. 

Keeping each process’s kernel stack at the same virtual location works well when using a single thread of 
execution processes, but is not advisable for multithreaded execution. For the multithreaded version of 
386BSD, "lightweight" processes will require multiple kernel stacks and will be allocated out of kernel 
global virtual memory as needed. 


103 



Configuring the 386 for UNIX Operation 


The kernel program’s address space established, we must "wire" the 386 processor hardware to the kernel 
interfaces and set initial conditions for the system, including interrupt and exception processing, user 
process address space definition, and preparation for context switching. All of the facilities must be set 
from the earliest point possible, because before we leave the kernel to execute a single user process 
instruction, we are already running multitasking. In fact, we will even use multitasking and exception 
processing as we initialize the system! This really should come as no surprise, as software aficionados can 
never resist the temptation to use double-duty or recursive code — or even inscrutable self-referential code. 
As a result, we page-fault the page tables to allocate them to be used in paging the first process. 


Segments Revisited 


There is an old expression that says: "If one is used to using a hammer, everything else becomes a nail." 
The 386 hammer of choice, segments, must be used to pound together the rest of the architecture no matter 
what. In other words, even if segments are not desired, they must be allocated and initialized. Because we 
have chosen to achieve most of our functionality via the paging mechanism (see DDJ January 1991), we 
try to minimize the need for the segmentation mechanism, but allow for future extensibility in such areas 
as dynamically growable tables (for example, ldt, gdt, ...). Currently, init386() relies only on a constant 
table (see |Listing Two , page 86). In addition, separate descriptors for data and code segments are required 
as they use different attribute sets, even though they exactly alias each other. This allows for some 
interesting effects, such as allowing code to be executed out of the stack! 


The approach outlined thus far has permitted coding to proceed in C. This is actually quite important, as 
the descriptor code and bitfield is obscure enough without any additional complications, such as additional 
coding in assembler. We actually could have worked in the reverse manner (invoking segments and then 
paging) by using the descriptors to relocate user space and run the kernel "low," but this would have 
significantly increased bookkeeping overhead when going between user and kernel without offering any 
clear advantage. Also, the construction of segment descriptors in assembly language can be quite tedious if 
done in this manner. 


Interrupts and Exceptions 

In the standalone system (see DDJ March 1991) we built a Global Descriptor Table to reinitialize 
segmentation. Now, we must follow the same techniques developed for the standalone system to build a 
Global Descriptor Table for descriptors used primarily by the kernel, and a Local Descriptor Table used 
primarily by the user tasks. (The local table can later be made "relative" per task if desired.) An Interrupt 
Descriptor Table must also be built that instructs the processor to execute special assembly code stubs 
(located at IDT-VEC(XXX) entry points) within the kernel (SEL_KPL) when any exceptions are 
triggered. Low-level code in each bus adapter’s support code, used to wire-down all possible interrupts, is 
called to catch unintended interrupts prior to the configuring of devices by the kernel. And finally, through 
the use of special assembly language entry points (see DDJ March 1991), the descriptor tables are loaded, 
with any user or kernel exceptions caught on the fly. 

Up to this point, we’ve just assumed that sufficient memory would be present to satisfy our needs, but this 
should not continue. Instead, we must probe and check the amount of memory present against recorded 
values in the system’s configuration memory. If a value appears unusually large, we choose the lesser of 


104 




the two. If both seem questionable, however, we revert to our minimum assumption — 640 Kbytes of base 
memory only. 

Memory in hand, we next initialize the virtual memory system that will manage both physical memory and 
virtual address space. The routine pmap_boot-strap() scales resources and assumptions based on available 
physical memory, and synchronizes the arrangement of the early "pmap" or physical map of the system to 
its internal data structures. The Mach virtual memory system, portions of which are incorporated into 
BSD, is split into machine dependent (pmap) and machine independent (vm_map) parts. 

The remaining portion of init386() creates a way for a user process to enter the kernel and an initial 
process state through which a user process can be run for the first time. Because processes inherit these 
characteristics, this "zeroth" process state in effect initializes all subsequent sibling processes! 


Upon executing init386() and main() (which initializes the kernel), the system is prepared for running the 
user process. Listing One (near the end) contains code which moves us into user space to execute the very 
first process. Little work is done to the user process itself — instead, the exception mechanisms are relied 
upon to supply memory and instructions to it. This occurs from the point of initialization, because the init 
process that starts the system itself is faulted in incrementally. 


Summary: What Do We Have Now? 

As you may have noted, over the course of this series we have been building upon our previous work as 
we head toward our goal, and increasingly we are relying on an understanding of our growing set of tools. 
And, at the same time, we have recently changed some of our code to accommodate some of the exciting 
new developments at Berkeley. With all the changes occurring, even those very familiar with this software 
can become somewhat "lost." 


At this stage, it is important to go back and recall the perspective we tried to established on this project. 
We compared it to that of climbing a mountain, and we carefully outlined and prepared for all the 
problems we thought we would encounter. However, even with all the preparation we could muster, we’ve 
still had to be fast on our feet. Paths which we had carefully mapped out just six months ago are wiped 
away by an avalanche — removed forever by the force of innovation. Work and time and effort have been 
tossed aside as we’ve been forced to adapt new approaches, not only to keep up with the group, but 
occasionally to set the pace (as in recursive paging). And finally, as our system grows, the complexity 
grows as well, and with it the blizzards of bugs and incompatibilities that occasionally blind and dispire. 

And now, after months of effort, we have developed the barest of kernels. We will continue on with our 
kernel development, but we now have the makings of the "Basic Kernel." Key elements of our Basic 
Kernel (multitasking, processes, device drivers, executing the first process, games tests, paging, and 
swapping) are crucial to establish a working understanding of 386BSD and Berkeley UNIX. We look 
forward to seeing you on the trail with us. 

Brief Notes: 386BSD Recursive Paging 

When we began this project, many of our notions were based on prior experience in that we emphasized 
the similarities of the 386 to other machines while discounting its idiosyncrasies. Like a new car owner 
fumbling in the wrong place for the headlight switch and cursing it for having moved from the dashboard 
to the steering column, we mainly tried to just get 386BSD running. However, once we felt "settled in," 


105 



we decided to see if we could take it to the limit. Consequently, the last few months have been like 
motoring with Mr. Toad — with the onslaught of software changes, "wild" seems too weak a word. 

In keeping with the CSRG goals for the upcoming 4.4BSD release, one major task was to migrate 386BSD 
to a virtual memory system derived from CMU’s MACH operating system. While this decision was 
appropriate, a major problem relating to the 386 arose almost immediately; when implemented as designed 
by CMU on the 386, the virtual memory system swallowed copious quantities of virtual address space for 
the operating system — space which is needed for user processes. Most of this space was gobbled up 
maintaining address maps of all in-memory processes page tables, so that the system could maintain 
access to them at all times should they become active. 

At the same time, we had been getting somewhat tired of how process page table mapping was handled in 
the traditional BSD virtual memory system; since the page tables themselves were in physical memory 
(for use by the 386’s MMU), we needed pages of page tables to map the page tables themselves before we 
could modify them. As you can guess, this increased the amount of "bookkeeping" overhead considerably, 
especially when interacting features are added (such as shared libraries and shared memory). We hoped 
there would be a better way. 

An ideal virtual memory system design gives access to information on the virtual-to-physical translation 
process (and the converse) very quickly. However, while the information is there, right on the same piece 
of silicon and working at warp speed doing just this, there is no way via software to invoke the mechanism 
other than through transparent processes — creating the "virtual memory" effect. (Don’t expect any change 
in this area any time soon, either, because for many hardware design reasons this is a nontrivial addition.) 
As a consequence, the systems programmer must encode a tedious subroutine with a sole purpose to 
emulate the same translation process in software that is performed in a fraction of the time of a single 
instruction by part of the hardware. 

On the 386, page tables and page directories appear very similar — in fact, they're identical in contents 
(see top of Listing Four , page 90). Turning the usual paging paradigms upside down, we examined what 
would occur if the page tables and page directories were viewed as if they were software data structures 
that could be connected in different ways. For example, frequently we want to find the page table entry 
associated with a given page. Obviously, the MMU does just this as it processes an ordinary reference to a 
page and continues on to "indirect" through the PTE to get to a page. Upon reflection, we noticed that if 
we arranged it so that the MMU goes through the same entry twice, we could get it to "use up" an 
indirection. This would allow us to reference the PTE itself instead of the underlying page. This approach, 
while unorthodox and confusing to the uninitiated, turned out to be quite feasible. 

Thus was born the "recursive" page map technique — one guaranteed to annoy the zealot and amaze the 
skeptic. Based on the "self-referential" model, the 386BSD recursive page table mechanism undergoes two 
iterations in the process of obtaining the PTE itself. In the first iteration, see Figure 4(a) , a reference is 
made to the PTE of the page table directory. In the second iteration, see Figure 4(b) , a reference is made 
to the PDE that maps the page directory itself. In other words, by "pointing" a page directory entry at the 
page directory itself, we have created a window in our virtual address map that consecutively maps all of 
the address space’s page tables (in corresponding order as well) with out the need for another page of 
memory. In addition, this technique also maps the page directory itself, as a consequence of the second 
indirection, through the "recursive" page directory element. 


106 



To return to the previously mentioned example, we can find a PTE for a page with the macro vtopte() as 
seen in Listing Four , which consists of just a shift and an add. Additional macros here demonstrate the 
simplicity this method gives the virtual memory system. 

The benefits of this technique are compelling: 

• We were able to reuse an existing data structure — the page directories and page tables (contrary to 
the intentions of the hardware designers, by the way) — thus reducing the memory cost of a process. 

• We were able to reduce the number of items we need to track per process, thus reducing bookkeeping 
overhead. 

• This method allowed us to conveniently mediate the cost of process page tables. (The process page 
tables belong to and don’t clutter up the operating system kernel space.) 

• We were able to increase the locality of reference, such that the processor cache performance is 
enhanced. 

• We were able to provide a more convenient model of memory for the operating system to exploit. 

Particularly relevant to items 2 and 5, by writing the 386 machine-dependent support routines in a 
recursive manner, we were able to make the code perform double-duty in a module a fraction of the size of 
previous 386 versions. In addition, the multiprocessor version of 386BSD may derive some benefit from 
this technique when used to hierarchically share page directory regions. It is rare when you find a method 
that conceptually fits so well and as a side-effect improves performance. 

This technique is not limited to the 386 by any means; other two-level paging MMU microprocessors 
(68030, Clipper, 32532, 88000, ...) theoretically can leverage this technique, though probably with less 
benefit. Because most of these processors have separate address spaces for kernel and user, waste in the 
kernel does not rob memory from the user process as it does on the 386. 

— B.J. and L.J. 

Figure 1(a): 

Minimal Kernel Breakdown (by module) 


vmunix: text 

data 

bss 

module name 

1152 

32 

0 

clock.o 

0 

500 

0 

conf. o 

4548 

740 

32 

cons.o 

1508 

24 

0 

init_main.o 

0 

1212 

0 

init_sysent. o 

1588 

28 

0 

kern_clock.o 

2044 

12 

0 

kern_descrip.o 

3296 

80 

0 

kern_exec.o 

1840 

48 

0 

kern_exit.o 

1600 

36 

0 

kern_fork.o 

956 

0 

0 

kern_mman.o 

312 

0 

0 

kern_proc.o 

1280 

0 

0 

kern_prot.o 

1216 

0 

0 

kern_resource. o 

3564 

32 

0 

kern_sig.o 

684 

16 

0 

kern_subr.o 

1808 

24 

0 

kern_synch.o 


107 




1864 

4 

0 

kern_time.o 

248 

0 

0 

kern_xxx.o 

6176 

20508 

0 

locore.o 

5596 

596 

0 

machdep.o 

0 

148 

0 

param.o 

2184 

84 

8 

subr_prf.o 

1092 

72 

0 

subr_rmap.o 

244 

0 

0 

subr_xxx.o 

184 

72 

0 

swapgeneric.o 

3340 

0 

0 

sys_generic.o 

4156 

68 

0 

sys_inode.o 

1096 

56 

0 

sys_process.o 

784 

16 

0 

sys_socket.o 

2260 

224 

0 

trap.o 

9480 

516 

0 

tty. o 

12 

204 

0 

tty_conf.o 

3928 

4 

0 

tty_pty.o 

1924 

0 

0 

tty_subr.o 

8680 

1220 

0 

ufs_alloc.o 

3312 

116 

0 

ufs_bio.o 

1668 

0 

0 

ufs_bmap.o 

1248 

48 

0 

ufs_disksubr.o 

416 

0 

0 

ufs_fio.o 

3968 

68 

0 

ufs_inode.o 

436 

0 

0 

ufs_machdep.o 

2048 

0 

0 

ufs_mount.o 

6020 

220 

0 

ufs_namei.o 

2288 

208 

0 

ufs_subr.o 

7100 

112 

0 

ufs_syscalls.o 

0 

620 

0 

ufs_tables.o 

0 

152 

0 

vers.o 

2280 

48 

0 

vm_drum.o 

2964 

52 

0 

vm_machdep.o 

4364 

180 

0 

vm_mem.o 

8280 

188 

0 

vm_page.o 

2056 

20 

0 

vm_proc.o 

3060 

24 

0 

vm_pt.o 

2788 

72 

0 

vm_sched.o 

528 

0 

0 

vm_subr.o 

1052 

32 

0 

vm_sw.o 

1836 

44 

0 

vm_swap.o 

1536 

152 

0 

vm_swp.o 

2048 

68 

0 

vm_text.o 

3768 

1492 

1024 

wd. o 

totals: 145708 

30492 

1064 


Figure 1(b): 

Fully Loaded Kernel Breakdown 

(by module) 

vmunix: text 

data 

bss 

module 

0 

4 

0 

af. o 

592 

16 

0 

autoconf.o 

844 

0 

0 

clock.o 

2584 

168 

0 

com. o 

0 

640 

0 

conf.o 

4096 

676 

40 

cons . o 

540 

132 

0 

dead_vnops.o 


108 



1440 

28 

0 

device_pager.o 

3180 

152 

48 

fd. o 

1264 

140 

0 

fifo_vnops.o 

2812 

12 

0 

if. o 

2600 

12 

0 

if_ether.o 

1056 

24 

18 

if_ethersubr.o 

464 

0 

0 

if_loop.o 

5044 

12 

12 

if_ne.o 

3184 

16 

0 

if_s1.o 

3852 

12 

4 

if_we.o 

2844 

4 

0 

in. o 

356 

0 

0 

in_cksum.o 

1684 

0 

0 

in_pcb.o 

12 

320 

0 

in_proto.o 

1496 

12 

0 

init_main.o 

0 

1532 

0 

init_sysent.o 

0 

468 

0 

ioconf.o 

2056 

68 

0 

ip_icmp.o 

4564 

60 

48 

ip_input.o 

2616 

0 

0 

ip_output.o 

1372 

4 

0 

isa. o 

1204 

16 

0 

kern_acct.o 

1280 

4 

0 

kern_clock.o 

3184 

0 

0 

kern_descrip.o 

3176 

0 

0 

kern_exec.o 

1424 

0 

0 

kern_exit.o 

996 

8 

4 

kern_fork.o 

1204 

0 

0 

kern_kinfo.o 

1772 

0 

0 

kern_ktrace.o 

1028 

4 

0 

kern_lock.o 

1892 

268 

0 

kern_malloc.o 

796 

0 

0 

kern_physio.o 

1180 

0 

0 

kern_proc.o 

1844 

0 

0 

kern_prot.o 

1140 

0 

0 

kern_resource.o 

4172 

132 

0 

kern_sig.o 

684 

0 

0 

kern_subr.o 

1988 

4 

0 

kern_synch.o 

1408 

4 

0 

kern_time.o 

264 

0 

0 

kern_xxx.o 

7076 

684 

0 

locore.o 

4684 

192 

0 

machdep.o 

552 

0 

0 

mem. o 

708 

44 

4 

mfs_vfsops.o 

656 

132 

0 

mfs_vnops.o 

1600 

0 

0 

nfs_bio.o 

1020 

0 

0 

nfs_node.o 

21700 

36 

0 

nfs_serv.o 

7748 

152 

0 

nfs_socket.o 

1040 

144 

21672 

nfs_srvcache.o 

10284 

40 

4 

nfs_subs.o 

1956 

72 

80 

nfs_syscalls.o 

2996 

40 

1 

nfs_vfsops.o 

21304 

424 

0 

nfs_vnops.o 

348 

12 

16 

npx. o 

0 

152 

0 

param.o 

6308 

16 

0 

pmap.o 


109 



2908 

4 

36 

164 

8 

0 

1072 

36 

0 

812 

0 

0 

2304 

8 

0 

4552 

116 

0 

2584 

60 

0 

2296 

180 

0 

716 

0 

0 

1764 

8 

0 

888 

0 

0 

340 

0 

0 

5456 

28 

0 

0 

40 

0 

3344 

0 

0 

0 

0 

0 

904 

56 

0 

604 

20 

0 

228 

0 

0 

5820 

8 

0 

1896 

16 

0 

1504 

12 

0 

832 

60 

0 

1620 

8 

0 

2912 

0 

0 

9488 

316 

0 

1864 

204 

0 

12 

204 

0 

3452 

4 

0 

1988 

0 

0 

504 

0 

0 

1980 

36 

0 

9644 

0 

0 

2012 

0 

0 

1424 

0 

0 

3756 

0 

0 

1668 

12 

0 

3832 

4 

0 

4572 

20 

0 

732 

0 

0 

0 

620 

0 

3948 

64 

0 

8264 

524 

0 

620 

0 

0 

2672 

64 

4 

8 

176 

0 

6164 

0 

0 

3184 

24 

0 

5520 

0 

0 

3320 

32 

0 

0 

232 

0 

3644 

0 

0 

1108 

4 

0 

0 

24 

0 

1776 

0 

0 

3940 

44 

0 

7544 

0 

0 


radix.o 
raw_cb.o 
raw_ip.o 
raw_usrreq.o 
route.o 
rtsock.o 
slcompress . o 
spec_vnops.o 
subr_log.o 
subr_prf.o 
subr_rmap.o 
subr_xxx.o 
swap_pager.o 
swapvmunix.o 
sys_generic.o 
sys_machdep.o 
sys_process.o 
sys_socket.o 
tcp_debug.o 
tcp_input.o 
tcp_output.o 
tcp_subr.o 
tcp_timer.o 
tcp_usrreq.o 
trap.o 
tty. o 

tty_compat.o 
tty_conf.o 
tty_pty.o 
tty_subr.o 
tty_tty.o 
udp_usrreq.o 
ufs_alloc.o 
ufs_bmap.o 
ufs_disksubr.o 
ufs_inode.o 
ufs_lockf.o 
ufs_lookup.o 
ufs_quota.o 
ufs_subr.o 
ufs_tables . o 
ufs_vfsops.o 
ufs_vnops.o 
uipc_domain.o 
uipc_mbuf.o 
uipc_proto.o 
uipc_socket.o 
uipc_socket2.o 
uipc_syscalls.o 
uipc_usrreq.o 
vers . o 
vfs_bio.o 
vfs_cache.o 
vfs_conf.o 
vfs_lookup.o 
vfs_subr.o 
vfs_syscalls.o 


110 



1684 

20 

0 

vfs_vnops.o 

3524 

0 

0 

vm_fault.o 

1964 

20 

0 

vm_glue.o 

84 

0 

0 

vm_init.o 

1848 

0 

0 

vm_kern.o 

944 

0 

308 

vm_machdep.o 

7624 

16 

0 

vm_map.o 

384 

20 

0 

vm_meter.o 

3196 

4 

0 

vm_mmap.o 

3588 

16 

0 

vm_object.o 

2500 

32 

0 

vm_page.o 

824 

8 

0 

vm_pageout.o 

636 

20 

0 

vm_pager.o 

1160 

0 

0 

vm_swap.o 

416 

0 

0 

vm_unix.o 

304 

0 

0 

vm_user.o 

2200 

28 

0 

vnode_pager.o 

6176 

1648 

524 

wd. o 

5252 

48 

9 

wt. o 


totals: 359636 12248 22832 

[LISTING ONE] 

/* locore.s: Copyright (c) 1990,1991 William Jolitz. All rights reserved. 

* Written by William Jolitz 1/90 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

*/ 


/* [Excerpted from i386/locore.s] */ 

#define R(s) s - KERNEL_BASE /* relocate references until mapping enabled */ 

/* Per-process region virtual address space is located at the top of user 

* space, growing down to the top of the user stack [set in the "high" kernel]. 

* At kernel startup time, the only per-process data we need is a kernel stack, 

* so we allocate SPAGES of stack pages for the purpose before calling the 

* kernel initialization code. */ 

. data 

.globl _boothowto, _bootdev, _cyloffset 

/* Temporary stack */ 

.space 128 
tmpstk: 


_boothowto: 

. long 0 

/* bootstrap options */ 


_bootdev: 

. long 0 

/* bootstrap device */ 


_cyloffset: 

. long 0 

/* cylinder offset of bootstrap partition 

*/ 

. text 




.globl 

start 



start: 




/* arrange for a 

warm boot from the BIOS at some point in the 

future 

movw 

$0x1234, 

0x472 


jmp If 




.space 

0x500 

# skip over BIOS data areas 



111 



/* pass parameters on stack (howto, bootdev, cyloffset) 

* note: 0(%esp) is return address of bootstrap that loaded this kernel. */ 


movl 

4(%esp), %eax 

movl 

%eax, R(_boothowto) 

movl 

8(%esp), %eax 

movl 

%eax, R(_bootdev) 

movl 

12(%esp), %eax 

movl 

%eax, R(_cylof fset) 


/* use temporary stack till mapping enabled to insure it falls within map */ 
movl $R(tmpstk), %esp 

/* find end of kernel image */ 

movl $R(_end), %ecx 

addl $NBPG-1, %ecx 

andl $~(NBPG-1), %ecx 

movl %ecx, %esi 

/* clear bss and memory for bootstrap page tables. */ 
movl $R(_edata), %edi 

subl %edi, %ecx 

addl $(SPAGES+1+1+1)*NBPG, %ecx 

# stack + page directory + kernel page table + stack page table 

xorl %eax, %eax # pattern 

cld 

rep 

stosb 


/* Map Kernel—N.B. don't bother with making kernel text RO, as 386 

* ignores R/W AND U/S bits on kernel access (only valid bit works) ! 

* First step - build page tables */ 

movl %esi, %ecx # this much memory, 

shrl $PGSHIFT, %ecx # for this many ptes 

movl $PG_V, %eax # having these bits set, 

leal (2+SPAGES)*NBPG(%esi), %ebx # physical address of Sysmap 

movl %ebx, R(_KPTphys) # in the kernel page table, 

call fillpt 

/* map proc 0's kernel stack into user page table page */ 
movl $SPAGES, %ecx # for this many ptes, 

leal 1*NBPG(%esi), %eax # physical address of stack in proc 0 

orl $PG_V|PG_URKW, %eax # having these bits set, 

leal (1+SPAGES)*NBPG(%esi), %ebx # physical address of stack pt 
addl $(ptei(_PTmap)-1)*4, %ebx 

call fillpt 

/* Construct an initial page table directory */ 

/* install a pde for temporary double map of bottom of VA */ 
leal (SPAGES+2)*NBPG(%esi), %eax # physical address of kernel pt 
orl $PG_V, %eax 
movl %eax, (%esi) 

/* kernel pde - same contents */ 

leal pdei(KERNEL_BASE)*4(%esi), %ebx # offset of pde for kernel 
movl %eax, (%ebx) 


112 



/* install a pde recursively mapping page directory as a page table! */ 
movl %esi, %eax # phys address of ptd in proc 0 

orl $PG_V, %eax 

movl %eax, pdei(_PTD)*4(%esi) 

/* install a pde to map stack for proc 0 */ 

leal (SPAGES+1)*NBPG(%esi), %eax # physical address of pt in proc 0 
orl $PG_V, %eax 

movl %eax, (pdei(_PTD)-1)*4(%esi) # which is where per-process maps! 

/* load base of page directory, and enable mapping */ 
movl %esi, %eax # phys address of ptd in proc 0 

orl $I386_CR3PAT, %eax 

movl %eax, %cr3 # load ptd addr into mmu 

movl %crO, %eax # get control word 

orl $0x80000001, %eax # and let s page! 
movl %eax, %cr0 # NOW! 

/* now running mapped */ 

pushl $begin # jump to high mem! 

ret 

/* now running relocated at SYSTEM where the system is linked to run */ 
begin: 

/* set up bootstrap stack */ 

movl $_PTD-SPAGES*NBPG, %esp # kernel stack virtual address top 
xorl %eax, %eax # mark end of frames with a sentinal 

movl %eax, %ebp 

movl %eax, _PTD # clear lower address space mapping 

leal (SPAGES+3)*NBPG(%esi), %esi # skip past stack + page tables. 

pushl %esi 

/* init386(startphys) main(startphys) */ 

call _init386 # wire 386 chip for unix operation 

call _main 

popl %eax 

/* find process (proc 0) to be run */ 
movl _curproc, %eax 

movl P_PCB(%eax), %eax 

/* build outer stack frame */ 
pushl PCB_SS(%eax) # user ss 

pushl PCB_ESP(%eax) # user esp 

pushl PCB_CS(%eax) # user cs 

pushl PCB_EIP(%eax) # user pc 

movw PCB_DS(%eax), %ds 

movw PCB_ES(%eax), %es 

lret # goto user! 

/* fill in pte/pde tables */ 
fillpt: 

movl %eax, (%ebx) /* stuff pte */ 

addl $NBPG, %eax /* increment physical address */ 

addl $4, %ebx /* next pte */ 

loop fillpt 

ret 


113 



[LISTING TWO] 


/* machdep.c: Copyright (c) 1989,1991 William Jolitz. All rights reserved. 

* Written by William Jolitz 7/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

*/ 

/* [excerpted from i386/i386/machdep.c] */ 

/* Initialize segments & interrupt table */ 

♦define GNULL_SEL 0 /* Null Descriptor */ 

♦define GCODE_SEL 1 /* Kernel Code Descriptor */ 

♦define GDATA_SEL 2 /* Kernel Data Descriptor */ 

♦define GLDT_SEL 3 /* LDT - eventually one per process */ 

♦define GTGATE_SEL 4 /* Process task switch gate */ 

♦define GPANIC_SEL 5 /* Task state to consider panic from */ 

♦define GPROCO_SEL 6 /* Task state process slot zero and up */ 

♦define NGDT GPROCO_SEL+1 

union descriptor gdt[GPROCO_SEL+1]; 

/* interrupt descriptor table */ 
struct gate_descriptor idt[32+16]; 

/* local descriptor table */ 
union descriptor ldt[5]; 

♦define LSYS5CALLS_SEL 0 /* forced by intel BCS */ 

♦define LSYS5SIGR_SEL 1 

♦define L43BSDCALLS_SEL 2 /* notyet */ 

♦define LUCODE_SEL 3 

♦define LUDATA_SEL 4 

/* ♦define LPOSIXCALLS_SEL 5 */ /* notyet */ 

struct i386tss tss, panic_tss; 

/* software prototypes — in more palitable form */ 
struct soft_segment_descriptor gdt_segs[] = { 

/* Null Descriptor */ 

{ 0x0, /* segment base address */ 

0x0, /* length - all address space */ 

0, /* segment type */ 

0, /* segment descriptor priority level */ 

0, /* segment descriptor present */ 

0,0, 

0, /* default 32 vs 16 bit size */ 

0 /* limit granularity (byte/page units)*/ }, 

/* Code Descriptor for kernel */ 

{ 0x0, /* segment base address */ 

Oxfffff, /* length - all address space */ 

SDT_MEMERA, /* segment type */ 

0, /* segment descriptor priority level */ 

1, /* segment descriptor present */ 

0,0, 


114 



1, /* default 32 vs 16 bit size */ 

1 /* limit granularity (byte/page units)*/ 

/* Data Descriptor for kernel */ 


0x0, 

Oxfffff, 
SDT_MEMRWA, 

o, r 

i, / v 

0 , 0 , 

1, / J 

1 /* 

/* LDT Descriptor 
(int) ldt, 
sizeof(ldt)-1, 
SDT_SYSLDT, 

0 , / 

1 , / 

0 , 0 , 

0 , / 

0 / 


/* segment base address */ 

/* length - all address space */ 
/* segment type */ 
segment descriptor priority level 
segment descriptor present */ 


/* Null Descriptor - Placeholder */ 


0x0, 
0x0, 

0 , 

0, 

0, 

0 , 0 , 

0, 

0 


/* segment base address */ 

/* length - all address space */ 
/* segment type */ 

/* segment descriptor priority level 
/* segment descriptor present */ 


sizeof(tss)-1, 
SDT_SYS3 8 6TSS, 
0, /* 

1, /* 

0 , 0 , 

0, /* 

0 /* 


/* Proc 0 Tss Descriptor */ 


0, /* 

sizeof(tss)-1, 
SDT_SYS3 8 6TSS, 
0, /* 

1, /* 

0 , 0 , 

0, /* 

0 /* 


0x0, 


0x0, 


0, 

/* 

0, 

/* 

0, 

/* 

o 

o 


0, 

/* 

0 

/* 


}, 


default 32 vs 16 bit size */ 
limit granularity (byte/page units)*/ 
V 

/* segment base address */ 

/* length - all address space */ 
/* segment type */ 

segment descriptor priority level */ 
segment descriptor present */ 


}, 


unused - default 32 vs 16 bit size */ 
limit granularity (byte/page units)*/ 


}, 


/* default 32 vs 16 bit size */ 

/* limit granularity (byte/page units)*/ 
/* Panic Tss Descriptor */ 

(int) &panic_tss, /* segment base address */ 


}, 


/* length - all address space */ 
/* segment type */ 

segment descriptor priority level */ 
segment descriptor present */ 

unused - default 32 vs 16 bit size */ 
limit granularity (byte/page units)*/ 


}, 


segment base address */ 

/* length - all address space */ 
/* segment type */ 

segment descriptor priority level */ 
segment descriptor present */ 


unused - default 32 vs 16 bit size */ 
limit granularity (byte/page units)*/ 
struct soft_segment_descriptor ldt_segs[] = { 

/* Null Descriptor - overwritten by call gate */ 

{ 0x0, /* segment base address */ 

/* length - all address space */ 
segment type */ 

segment descriptor priority level */ 
segment descriptor present */ 

default 32 vs 16 bit size */ 
limit granularity (byte/page units)*/ 


}} 


}, 


115 



/* Null Descriptor - overwritten by call gate */ 

{ 0x0, /* segment base address */ 

0x0, /* length - all address space */ 

0, /* segment type */ 

0, /* segment descriptor priority level */ 

0, /* segment descriptor present */ 

0,0, 

0, /* default 32 vs 16 bit size */ 

0 /* limit granularity (byte/page units)*/ }, 

/* Null Descriptor - overwritten by call gate */ 

{ 0x0, /* segment base address */ 

0x0, /* length - all address space */ 

0, /* segment type */ 

0, /* segment descriptor priority level */ 

0, /* segment descriptor present */ 

0,0, 

0, /* default 32 vs 16 bit size */ 

0 /* limit granularity (byte/page units)*/ }, 

/* Code Descriptor for user */ 

{ 0x0, /* segment base address */ 

Oxfffff, /* length - all address space */ 

SDT_MEMERA, /* segment type */ 

SEL_UPL, /* segment descriptor priority level */ 

1, /* segment descriptor present */ 

0,0, 

1, /* default 32 vs 16 bit size */ 

1 /* limit granularity (byte/page units)*/ }, 

/* Data Descriptor for user */ 

{ 0x0, /* segment base address */ 

Oxfffff, /* length - all address space */ 

SDT_MEMRWA, /* segment type */ 

SEL_UPL, /* segment descriptor priority level */ 

1, /* segment descriptor present */ 

0,0, 

1, /* default 32 vs 16 bit size */ 

1 /* limit granularity (byte/page units)*/ } }; 

/* table descriptors - used to load tables by microp */ 
struct region_descriptor r_gdt = { 
sizeof(gdt)-1,(char *)gdt 

}; 

struct region_descriptor r_idt = { 
sizeof(idt)-1,(char *)idt 

}; 

setidt(idx, func, typ, dpi) char *func; { 

struct gate_descriptor *ip = idt + idx; 

ip- > gd_looffset = (int)func; 

ip- > gd_selector = GSEL(GCODE_SEL,SEL_KPL); 

ip- > gd_stkcpy = 0; 

ip->gd_xx = 0; 

ip->gd_type = typ; 

ip->gd_dpl = dpi; 

ip->gd_p = 1; 

ip- > gd_hioffset = ((int)func)>>16 ; 

} 

#define IDTVEC(name) X/**/name 

extern IDTVEC(div), IDTVEC(dbg), IDTVEC(nmi), IDTVEC(bpt), IDTVEC(ofl), 
IDTVEC(bnd), IDTVEC(ill), IDTVEC(dna), IDTVEC(dble), IDTVEC(fpusegm), 


116 



IDTVEC(tss), IDTVEC(missing), IDTVEC(stk), IDTVEC(prot), 

IDTVEC(page), IDTVEC(rsvd), IDTVEC(fpu), IDTVEC(rsvdO), 

IDTVEC(rsvdl), IDTVEC(rsvd2) , IDTVEC(rsvd3), IDTVEC(rsvd4) , 
IDTVEC(rsvd5), IDTVEC(rsvd6) , IDTVEC(rsvd7), IDTVEC(rsvd8) , 
IDTVEC(rsvd9), IDTVEC(rsvdl0), IDTVEC(rsvdl1), IDTVEC(rsvdl2), 
IDTVEC(rsvdl3), IDTVEC(rsvdl4), IDTVEC(rsvdl4), IDTVEC(syscall); 
int lcrO(), lcr3(), rcrO(), rcr2(); 
int _udatasel, _ucodesel, _gsel_tss; 

init386 () { extern ssdtosdO, lgdt () , lidt(), lldt(), etext; 

int x; 

unsigned biosbasemem, biosextmem; 
struct gate_descriptor *gdp; 
extern int sigcode,szsigcode; 
struct pcb *pb = procO.p_addr; 

/* initialize console */ 
cninit (); 

/* make gdt memory segments */ 

gdt_segs[GCODE_SEL].ssd_limit = btoc((int) Setext + NBPG); 
gdt_segs[GPROCO_SEL].ssd_base = pb; 

for (x=0; x < NGDT; x++) ssdtosd(gdt_segs+x, gdt+x); 

/* make ldt memory segments */ 

ldt_segs[LUCODE_SEL].ssd_limit = btoc(UPT_MIN_ADDRESS); 
ldt_segs[LUDATA_SEL].ssd_limit = btoc(UPT_MIN_ADDRESS); 

/* Note, eventually want private ldts per process */ 
for (x=0; x < 5; x++) ssdtosd(ldt_segs+x, ldt+x) ; 

/* exceptions */ 

setidt(0, &IDTVEC(div), SDT_SYS386TGT, SEL_KPL); 
setidt(1, &IDTVEC(dbg), SDT_SYS386TGT, SEL_KPL); 
setidt(2, &IDTVEC(nmi), SDT_SYS386TGT, SEL_KPL); 
setidt(3, &IDTVEC(bpt), SDT_SYS386TGT, SEL_UPL); 
setidt(4, &IDTVEC(ofl), SDT_SYS386TGT, SEL_KPL); 
setidt(5, &IDTVEC(bnd), SDT_SYS386TGT, SEL_KPL); 
setidt(6, &IDTVEC(ill), SDT_SYS386TGT, SEL_KPL); 
setidt(7, &IDTVEC(dna), SDT_SYS386TGT, SEL_KPL); 
setidt(8, &IDTVEC(dble), SDT_SYS386TGT, SEL_KPL); 
setidt(9, &IDTVEC(fpusegm), SDT_SYS386TGT, SEL_KPL); 
setidt(10, &IDTVEC(tss), SDT_SYS386TGT, SEL_KPL); 
setidt(11, &IDTVEC(missing), SDT_SYS386TGT, SEL_KPL); 
setidt(12, &IDTVEC(stk), SDT_SYS386TGT, SEL_KPL); 
setidt(13, & IDTVEC(prot), SDT_SYS386TGT, SEL_KPL); 
setidt(14, & IDTVEC(page), SDT_SYS386TGT, SEL_KPL); 
setidt(15, &IDTVEC(rsvd) , SDT_SYS386TGT, SEL_KPL); 
setidt(16, &IDTVEC(fpu) , SDT_SYS386TGT, SEL_KPL); 
setidt(17, & IDTVEC(rsvdO), SDT_SYS386TGT, SEL_KPL); 
setidt(18, & IDTVEC(rsvdl), SDT_SYS386TGT, SEL_KPL); 
setidt(19, & IDTVEC(rsvd2), SDT_SYS386TGT, SEL_KPL); 
setidt(20, & IDTVEC(rsvd3), SDT_SYS386TGT, SEL_KPL); 
setidt(21, & IDTVEC(rsvd4), SDT_SYS386TGT, SEL_KPL); 
setidt(22, & IDTVEC(rsvd5), SDT_SYS386TGT, SEL_KPL); 
setidt(23, & IDTVEC(rsvd6), SDT_SYS386TGT, SEL_KPL); 
setidt(24, & IDTVEC(rsvd7), SDT_SYS386TGT, SEL_KPL); 
setidt(25, & IDTVEC(rsvd8), SDT_SYS386TGT, SEL_KPL); 
setidt(26, & IDTVEC(rsvd9), SDT_SYS386TGT, SEL_KPL); 
setidt(27, & IDTVEC(rsvdlO), SDT_SYS386TGT, SEL_KPL); 
setidt(28, & IDTVEC(rsvdl1), SDT_SYS386TGT, SEL_KPL); 
setidt(29, &IDTVEC(rsvdl2), SDT_SYS386TGT, SEL_KPL); 
setidt(30, &IDTVEC(rsvdl3), SDT_SYS386TGT, SEL_KPL); 


117 



setidt(31, &IDTVEC(rsvdl4), SDT_SYS386TGT, SEL_KPL); 

#include "isa.h" 

#if NISA >0 

isa_defaultirq(); 

#endif 

/* load descriptor tables into 386 */ 
lgdt(gdt, sizeof(gdt)-1); 
lidt(idt, sizeof(idt)-1) ; 
lldt(GSEL(GLDT_SEL, SEL_KPL)); 

/* resolve amount of memory present so we can scale kernel PT */ 
maxmem = probemem(); 

biosbasemem = rtcin(RTC_BASELO)+ (rtcin(RTC_BASEHI)<<8); 
biosextmem = rtcin(RTC_EXTLO)+ (rtcin(RTC_EXTHI)«8); 
if (biosbasemem == Oxffff || biosextmem == Oxffff) { 
if (biosbasemem == Oxffff && maxmem > RAM_END) 
maxmem = IOM_BEGIN; 

if (biosextmem == Oxffff && maxmem > RAM_END) 
maxmem = IOM_BEGIN; 

} else if (biosextmem > 0 && biosbasemem == IOM_BEGIN/1024) { 

int totbios = (biosbasemem + 0x60000 + biosextmem); 
if (totbios < maxmem) maxmem = totbios; 

} else maxmem = IOM_BEGIN; 

/* call pmap initialization to make new kernel address space */ 
pmap_bootstrap (); 

/* now running on new page tables, configured,and u/iom is accessible */ 
/* make a initial tss so microp can get interrupt stack on syscall! */ 
pb->pcbtss.tss_esp0 = UPT_MIN_ADDRESS; 
pb->pcbtss.tss_ss0 = GSEL(GDATA_SEL, SEL_KPL) ; 

_gsel_tss = GSEL(GPROC0_SEL, SEL_KPL); 
ltr(_gsel_tss); 

/* make a call gate to reenter kernel with */ 
gdp = &ldt[LSYS5CALLS_SEL].gd; 
gdp->gd_looffset = (int) &IDTVEC(syscall); 
gdp->gd_selector = GSEL(GCODE_SEL,SEL_KPL); 
gdp->gd_stkcpy = 0; 
gdp->gd_type = SDT_SYS386CGT; 
gdp->gd_dpl = SEL_UPL; 
gdp->gd_p = 1; 

gdp->gd_hioffset = ((int) &IDTVEC(syscall)) >>16; 

/* transfer to user mode */ 

_ucodesel = LSEL(LUCODE_SEL, SEL_UPL); 

_udatasel = LSEL(LUDATA_SEL, SEL_UPL); 

/* setup per-process */ 

bcopy(&sigcode, pb->pcb_sigc, szsigcode); 
pb->pcb_flags = 0; 
pb->pcb_ptd = IdlePTD; 


[LISTING THREE] 


/* Machine dependent constants for 386. */ 


/* user map constants */ 
♦define VM_MIN_ADDRESS 
♦define UPT_MIN_ADDRESS 
♦define UPT_MAX_ADDRESS 
♦ define VM_MAX_ADDRESS 


((vm_offset_t)0) 

((vm_offset_t)OxFDCOOOOO) 
((vm_offset_t)0xFDFF7000) 
UPT_MAX_ADDRESS 


118 



/* kernel map constants */ 

#define VM_MIN_KERNEL_ADDRESS ( (vm_offset_t)0xFDFF7000) 

♦define KPT_MIN_ADDRESS ((vm_offset_t)0xFDFF8000) 

♦define KPT_MAX_ADDRESS ((vm_offset_t)OxFDFFFO00) 

♦define KERNEL_BASE OxFEOOOOOO 

♦ define VM_MAX_KERNEL_ADDRESS ( (vm_offset_t)0xFF7FF000) 

/* # of kernel PT pages (initial only, can grow dynamically) */ 

♦ define VM_KERNEL_PT_PAGES ((vm_size_t)1) 

[LISTING FOUR] 

/* 

* pmap.h: Copyright (c) 1990,1991 William Jolitz. All rights reserved. 

* Written by William Jolitz 12/90 

* 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

* 

*/ 

/* 

* [excerpted from i386/pmap.h] 

* Recursive map version by W. Jolitz 
*/ 

/* page directory element */ 
struct pde 
{ 

unsigned int 

pd_v:l, /* valid bit */ 

pd_prot:2, /* access control */ 

pd_mbzl:2, /* reserved, must be zero */ 

pd_u:1, /* hardware maintained 'used' bit */ 

:1, /* not used */ 

pd_mbz2:2, /* reserved, must be zero */ 

:3, /* reserved for software */ 

pd_pfnum:20; /* physical page frame number of pte's*/ 

}; 

♦define PD_MASK OxffcOOOOO /* page directory address bits */ 

♦define PT_MASK 0x003ff000 /* page table address bits */ 

♦define PD_SHIFT 22 /* page directory address shift */ 

♦define PG_SHIFT 12 /* page table address shift */ 

/* page table element */ 
struct pte 
{ 

unsigned int 

pg_v:l, /* valid bit */ 

pg_prot:2, /* access control */ 

pg_mbzl:2, /* reserved, must be zero */ 

pg_u:1, /* hardware maintained 'used' bit */ 


119 



pg_m:l, /* hardware maintained modified bit */ 

pg_mbz2:2, /* reserved, must be zero */ 

pg_w:l, /* software, wired down page */ 

:1, /* software (unused) */ 

pg_nc:l, /* 'uncacheable page' bit */ 

pg_pfnum:20; /* physical page frame number */ 

} ; 

#define PG_V 0x00000001 

♦define PG_RO 0x00000000 

♦define PG_RW 0x00000002 

♦define PG_u 0x00000004 

♦define PG_PROT 0x00000006 /* all protection bits . */ 

♦define PG_W 0x00000200 

♦define PG_N 0x00000800 /* Non-cacheable */ 

♦define PG_M 0x00000040 

♦define PG_U 0x00000020 

♦define PG_FRAME OxfffffOOO 

♦define PG_NOACC 0 

♦define PG_KR 0x00000000 

♦define PG_KW 0x00000002 

♦define PG_URKR 0x00000004 

♦define PG_URKW 0x00000004 

♦define PG_UW 0x00000006 

/* 

* Page Protection Exception bits 
*/ 

♦define PGEX_P 0x01 /* Protection violation vs. not present */ 

♦define PGEX_W 0x02 /* during a Write cycle */ 

♦define PGEX_U 0x04 /* access from User mode (UPL) */ 

/* 

* Address of current address space page table maps 

* and directories. 

*/ 

extern struct pte PTmap[], Sysmap[]; 
extern struct pde PTD[], PTDpde; 

/* 

* virtual address to page table entry and to physical address. 

* Note: these work recursively, thus vtopte of a pte will give 

* the corresponding pde that it in turn maps into. 

*/ 

♦define vtopte(va) (PTmap + i386_btop(va)) 

♦define ptetov(pt) (i386_ptob(pt - PTmap)) 

♦define vtophys(va) (i386_ptob(vtopte(va)->pg_pfnum) | ((int)(va) & PGOFSET)) 

♦define ispt(va) ((va) >= UPT_MIN_ADDRESS && (va) <= KPT_MAX_ADDRESS) 

/* 

* macros to generate page directory/table indicies 
*/ 

♦define pdei(va) (((va)&PD_MASK)»PD_SHIFT) 

♦define ptei(va) (((va)&PT_MASK)>>PT_SHIFT) 


120 



[LISTING FIVE] 


/* param.h: Copyright (c) 1989,1990,1991 William Jolitz. All rights reserved. 

* Written by William Jolitz 6/89 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS'' AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

*/ 

/* Machine dependent constants for Intel 386. */ 

#define MACHINE "i386" 

♦define NBPG 4096 /* bytes/page */ 

♦define PGOFSET (NBPG-1) /* byte offset into page */ 

♦define PGSHIFT 12 /* LOG2(NBPG) */ 

♦define NPTEPG (NBPG/(sizeof (struct pte))) 

♦define NBPDR (1024*NBPG) /* bytes/page dir */ 

♦define PDROFSET (NBPDR-1) /* byte offset into page dir */ 

♦define PDRSHIFT 22 /* LOG2(NBPDR) */ 

♦define KERNBASE OxFEOOOOOO /* start of kernel virtual */ 

♦define DEV_BSIZE 512 

♦define DEV_BSHIFT 9 /* log2(DEV_BSIZE) */ 

♦define CLSIZE 1 

♦define CLSIZELOG2 0 

♦define SSIZE 1 /* initial stack size/NBPG */ 

♦define SINCR 1 /* increment of stack/NBPG */ 

♦define SPAGES 2 /* pages of kernel stack area */ 

/* clicks to bytes */ 

♦ define ctob(x) ( (x) OPGSHIFT) 

♦define btodb(bytes) /* calculates (bytes / DEV_BSIZE) */ \ 

((unsigned)(bytes) >> DEV_BSHIFT) 

♦define dbtob(db) /* calculates (db * DEV_BSIZE) */ \ 

((unsigned)(db) « DEV_BSHIFT) 

/* Map a ''block device block'' to a file system block. This should be device 

* dependent, and will be if we add an entry to cdevsw/bdevsw for that purpose. 

* For now though just use DEV_BSIZE. */ 

♦define bdbtofsb(bn) ((bn) / (BLKDEV_IOSIZE/DEV_BSIZE)) 

/* Mach derived conversion macros */ 

♦define i386_round_pdr(x) ((((unsigned)(x)) + NBPDR - 1) & ~(NBPDR-1)) 

♦define i386_trunc_pdr(x) ((unsigned)(x) & ~(NBPDR-1)) 

♦define i386_round_page(x) ((((unsigned)(x)) + NBPG - 1) & ~(NBPG-1)) 

♦define i386_trunc_page(x) ((unsigned)(x) & ~(NBPG-1)) 

♦define i386_btod(x) ((unsigned)(x) >> PDRSHIFT) 

♦define i386_dtob(x) ((unsigned)(x) << PDRSHIFT) 

♦define i386_btop(x) ((unsigned)(x) >> PGSHIFT) 

♦define i386_ptob(x) ((unsigned)(x) << PGSHIFT) 


121 



Figure 2 



Figure 2: j86BSD Process Virtual Space 


Figure 4(a) 


122 



























123 













Figure 3 


OxFEOOOOOO 

Bottom of O/S Kernel 1 


OxFDFFSOOO 

Kernel Page Table 
(Sysmap) 


Kernel Global Address Space 

0xFDFF70O0 

Page Table Directory 

User Page Tables 
(high) 




Empty User 

Page Tables 

User Process Private Address 

OxFDCOOOOO 

User Page Tables 

_(!?wl_ 


Space 

* 

Top of User Process , 



Figure J: j 86BSD Virtual Address Space Page 7 able Locations 


Figure 4(b) 


124 




125 













Porting Unix To The 386: The Basic Kernel 

Initialization of the 386BSD kernel services and data structures 

Overview and initialization 

This article contains the following executables: 386BSD.891 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory micro-processor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. Copyright (c) 1991 TeleMuse. 

In the previous article we examined the machine-dependent layer initialization of the "stripped-down" 
kernel — the machine-dependent portion of the kernel which installs the kernel into the position to execute 
processes (via the bootstrap procedure) and prepares the system for initialization of the minimum 
machine-independent portions of the kernel (processes, files, and pertinent tables). We viewed our 
386BSD kernel as a kind of "virtual machine" (not to be confused with the "virtual" in "virtual memory"), 
where functions underlie other functions transparently. When initialized, the system can use portions that 
require little direction to initialize even larger portions. Thus, this virtual machine assembles itself tool by 
tool, much like a set of Russian dolls. The machine-dependent kernel initialization is the innermost of the 
dolls — the kernel of the kernel around which all is built. 

We now extend the layered model further, by incrementally turning on all its internal services using the 
kernel’s main)) procedure. In other words, this next outer layer will be built by the kernel’s main) ) 
procedure, which in turn initializes higher-level portions of the kernel. This is the second major milestone 
of our UNIX port — the halcyon point where most of the kernel services and data structures are initialized. 

At this stage, we’ll review key elements of the BSD kernel which will be invoked in future articles. We’ll 
briefly examine the interrelationships between some of these elements, in order to delineate the broader 
picture a bit more and illuminate some important ideas in UNIX system design. 


More Details. 


Layered Modeling: Achieving a Well-Stacked System 

A basic understanding of the entire system demands a return to the layered model described last month. In 
brief, the kernel is a program which runs in supervisor mode on the 386 (or "Ring 0" — for a review on 
rings, see "The Standalone System" DDJ March 1991). The kernel implements the primitives, called 
"system calls," of UNIX and manages the environment and other characteristics of the many user 
"processes" run to provide functionality to the system. (Each user process runs in a separate "Ring 3" 
address space.) Only processes running in their protected address spaces are truly visible to the UNIX 
user, as they provide the requested functionality (a command processor or "shell," a compiler, an editor, 
and so on). These processes constitute the outermost layer of our layered operating system model. System 


126 




calls and various processor exceptions (page fault, device interrupt, overflow , and so on) are methods by 
which a process either directly or indirectly enters the UNIX kernel to request services. In this way, the 
kernel acts as a transparent (not statically-linked) subroutine library that functions as a kind of virtual 
machine. It’s as if the microprocessor hardware itself actually did a whole read() system call request in 
the single lcall instruction used to signal a system call. (For further information, see Leffler, et al, The 
Design and Implementation of the 4.3BSD UNIX Operation System, Chapter 3: Kernel Services, page 
43-45, Addison-Wesley, 1989.) 

Layers within the kernel are split into the (mostly) machine-independent "top" half, and the (mostly) 
machine-dependent "bottom" half. The top half synchronously processes exceptions and system calls and 
blocks a currently executing process when an event causes a delay, such as a temporary memory shortage, 
or a device input/output operation. Blocking a process permits the kernel to run another delayed process, 
allowing multiple processes to appear to run concurrently on a single processor. The bottom half, in 
contrast, asynchronously processes device interrupts that are never allowed to be blocked. Device 
interrupts can be viewed, therefore, as high-priority, real-time tasks brought to life by a hardware interrupt 
to render the necessary effect and then exit the stage. They then return the kernel from the interrupt back 
to whatever code was running before the interrupt occurred. In a way, device interrupts are practically 
stateless and serve primarily to signal the occurrence of an external event to the synchronous "top" layers. 
Note, however, that such notifications will only take effect when the top layers allow preemption — in the 
UNIX model, this is only allowed at certain points when operating in the "top" layers (generally, when 
returning to the user process from the kernel, or when blocking for an event). 

The impact of the layered model on our 386BSD port cannot be understated. Our 386BSD system can be 
broken up into modular subsystems that have neat boundaries and work by a handful of simple rules. One 
rule which we live by is the aforementioned low-level asynchronous, top-level synchronous arrangement. 
For example, if we describe some code that must block for a resource, we are already restricted to a 
discussion of the top layer. Likewise, if we describe an event that occurs as a result of a peripheral 
completing an operation, rest assured it resides within the lower layers. This organization allows us to 
break the whole into parts we can handle; otherwise we quickly become mired in complexity. 

Understanding and following the rules inherent in our layered model greatly simplifies UNIX kernel 
design. Without these rules, we would need a lot more "critical region" code dealing with arbitrary 
preemption. These same rules, however, also limit our ability to easily implement UNIX in real-time 
environments. (For example, Ethernet delays are sometimes unpredictable.) Some versions of UNIX 
attempt to improve real-time performance by minimizing worst-case delays through the judicious addition 
of blocks to allow high-priority processes to run, but this is not a simple fix. 

There is reason to believe that the synchronous design of UNIX limits its performance with disk file 
writes. In a paper presented at the Summer 1990 USENIX Technical Conference, "Why Aren’t Operating 
Systems Getting Faster as Fast as Hardware?" (USENIX Technical Proceedings, page 247) John Osterhout 
of the University of California at Berkeley discusses the need to "decouple filesystem...operations that 
modify files," as synchronous writes are required for filesystem stability, consistency, and crash recovery. 
This is currently not a great problem, because filesystem reads (the majority of operations) can be 
elegantly cached and anticipated. Still, this raises questions about the current UNIX model and may result 
in its revision. 


127 



Top-Level Layers 

Last month, we discussed bottom-level initialization, where the Interrupt Descriptor Table (IDT), a 386 
hardware interface to interrupts and exceptions, was wired into code entry points (IDTVEC (XXX)). This 
is how 386BSD glues the hardware interrupts onto the bottom layer. In addition, some of the top-layer 
interfaces were also established. Now we need to build and initialize the other top-layer kernel functions in 
our BSD kernel main( ). With the kernel initialized, we must get the ball rolling by bootstrapping user 
processes to add functionality in the form of services (such as a command processor that will allow useful 
work to be done with our 386BSD system). 

To implement the UNIX model, we refer to many items — all of which are managed by the top layers and 
referenced by lower layers. These include processes, address spaces, files, filesystems, buffers, messages, 
signals, credentials, and others. They are grouped into a global set managed by the kernel on behalf of all 
processes, and a private set managed by the process for the benefit of the program running within the 
process. 

The Global Kernel Set 

The global set of objects is split into a shared database (proc, inode, buffer cache, and file structures), as 
well as a group of consumable resources (memory pages, data buffers, and heap storage). The shared 
database objects use methods for which items are searched, contended, modified, allocated, deleted, and 
linked together; all in a preemptible fashion, since many processes may attempt simultaneous access to 
these objects during multitasking operation. These databases must be initialized, allocated the appropriate 
minimum requirements for operation, and linked as the system requires. 

The UNIX paradigm is "process-centric," that is, most of the activity is built around the current running 
process. With the exception of necessary functions such as scheduling or interprocess communications 
(which require knowledge of multiple processes), most of the kernel is written without any explicit 
knowledge of any process save the running one. Thus, an understanding of how a process is provided 
services tells you the bulk of what the kernel does. Very little of the code and data structures are explicitly 
aimed at this "global" view. 

The focus of kernel activity is the list of processes (the "proc table"). Processes are linked into various lists 
so they can locate other processes through various relationships. As the kernel operates, processes migrate 
onto and off of different lists. While the process structure is not globally allocated, the list of entries is 
itself a global resource. Each system call and exception is directed to operate on a given process. As such, 
the BSD kernel uses the struct proc entry of each process as the key data structure that indexes all related 
kernel entities of a process. 

Process Private Set 


Each process possesses a number of data structures which can be leveraged to properly implement the 
UNIX model. So many of these are required that we reduce them, for simplicities sake, to a given set from 
which we draw upon in our discussions. All these properties of the process are rooted in the "per-process 
data structure," also known as struct proc or "proc slot" (see Listing One , page 126). This is just one 
element of the previously mentioned list of processes which defines just what a process is. 


128 



The Proc Slot 


In |Figurc 1| , 386BSD uses a proc slot as the nexus of information for a process. Many different structures, 


most of them dynamically allocated, hang off this single proc entry (see |Figure 2| ). These may be, in 
different cases, shared by processes, dynamically grown, or externalized to special applications. Among 
the auxiliary structures are: 


p_cred This structure is the process’s credentials, that is, the information (such as user ID number and 
group membership) used to regulate access to system resources by the process. This information is 
managed in a generic fashion by most of the kernel and is consulted by a tiny, centralized portion of the 
kernel (so that additional security control mechanisms can be added or substituted). It is shared by sibling 
processes of like ownership. 


p_fd Each process has a private file descriptor table: a dynamically allocated, growable structure used to 
store information on files currently open by the process. (Older versions of UNIX had a static limit on the 
number of open files, usually 20.) 

p_stats Statistics on the use of various resources consumed by the process. For example, the amount of 
time the processor used, the memory used, and various other details are tallied by this structure. 


p_limits Analogous to statistic recording on the process, this auxiliary structure is used to put 
administrative limits on resource utilization. 


p_vmspace Another critical resource for the process is contained in the virtual address space, details of 
which can be found within each process’s p_vmspace data structure. Among the data available is the 
virtual memory system’s address map (vm_map) which heads a table of address map entries. Each entry, 
in turn, manages allocated regions of virtual address space. Also, each process contains a physical map 
(vm_pmap) structure, managed by the pmap layer, containing current address translation state information 
(discussed later in the "Virtual Memory Subsystem" section). 

p_sigacts POSIX process signals, a kind of software interrupt for user processes, are implemented with the 
signal action state information in this structure. 

p_pgrp POSIX provides for the concept of "sessions" as a method of organizing process groups. Process 
groups are a set of processes operating together (for example, a pipeline such as "foo | bar | bletch"). 
Sessions utilize a session leader (usually a command processor or shell) that manages process groups. It 
has the ability to suspend or resume process groups run connected to a terminal (in the "foreground") or 
detached from the terminal (in the "background"). The data structures used to manage this feature reside in 
this shared data structure. 


The proc structure in 386BSD highlights the modularity of function present in the BSD design. Although 
BSD is currently implemented as a monolithic kernel, it can be arranged so that multithreaded distributed 
kernel operation can be achieved. In general, BSD kernel development has been focused around the 
revision and examination of the monolithic operating system kernel prior to implementation in a 
multithreaded kernel. This approach seeks to avoid putting the cart before the horse, so to speak, and 
avoids vacuous "modularity" modifications which purport to work only in a multiprocessor environment. 
This is not reticence in design — merely caution. 


129 




Multiprocessor systems are desirable, so the pent up enthusiasm to take advantage of them can overwhelm 
the many research directions available and result in the canonization of inappropriate or short-sighted 
standards. Current standards efforts are making headway, although the overall multiprocessor architecture 
is still unknown. (For example, some POSIX groups are attempting to define a standard for thread 
programming, and currently the most popular standard is one contrary to UNIX primitives, because the 
group touting this standard would rather ignore UNIX. This will result in another pointless standard taking 
its place alongside the dodo and other dead-end events of history.) 

Due to the way this arrangement results in "data hiding," the facilities of filesystems, accounting, 
administration, virtual memory, and POSIX signal processing are each separated from the inner part of the 
operating systems kernel. Each can be evolved separately or redefined with minimal interaction, as befits a 
modular design. 

Kernel Events 

Processes operate synchronously, processing a system call item by item. If they need to wait for either a 
resource or an external event, they must block with a sleep() function call to await changes and give up 
the processor. Elsewhere in the kernel, a corresponding wakeup() function call will awaken the snoozing 
process, preparing the process to run when next possible. While sleep gives up the processor, wakeup 
schedules processes to run — it does not transfer to processes nor even insure that the process will ever 
run. Wakeup calls are idempotent. Many can occur before the process actually stalls to run. 

A process can only wait on a single event at a time, and is usually uniquely identified by the kernel 
address of the object for which it waits. This event is stored in the p_wchan field of the process’s proc slot. 
Events themselves don’t require additional space when active, so secondary or recursive effects (as might 
happen in the case of a block on memory starvation) don’t occur. 

As the 386BSD system and its drivers are all written with these event mechanisms in place, we are 
potentially multitasking from the start, although until we replicate (or "fork") to create multiple processes, 
no actual context switches occur to different processes. (There are no other processes to switch to.) 

Instead, the processor is allowed to idle, waiting for events. In the UNIX perspective, we always try to 
organize the general case so that it functions seamlessly on initialization, in order to leverage it early. An 
example of this approach can be observed in the mechanisms that provide diskless operation, where we 
must provide a root filesystem over a network connection before we have a filesystem to run the programs 
that normally initialize the network and locate the filesystem on the network. (Got it? Good.) 

Machine-Independent Initialization 

Machine-independent initialization is begun by wiring up a "process zero." In the previous article, we took 
care in the assembly language initialization to craft a separate region for the kernel stack — this will be our 
"Oth" process kernel stack. We then commence the creation and attachment of the necessary auxiliary data 
structures that process 0 will use during the lifetime of our system. No process is specially considered, so 
all must have these structures present and consistent with other structures in the kernel. To avoid recursive 
problems with the "virgin" birth, the first process must be hand-wired with the barest of necessities, and 
space for the auxiliary structures must be allocated statically. In fact, we will find that process 0 will 
attempt to become eternal, so it’s actually more costly to dynamically allocate space for it than to do so 
statically! 


130 



Having made a Oth process, we now must create a process list to which the system can refer in a global 
fashion, to locate, add, delete, and modify processes. This is not really complicated, because at this point 
all of the queue pointers point either at our just-born process O or at "nil." At this stage, all process-related 
operations can now be activated, although only for statically-allocated processes (which is not very 
interesting — we need to turn on the virtual memory and storage allocation functions for something more 
useful). 

UNIX likes to have access to herds of processes, many appearing to run simultaneously,to do its bidding. 
As a result, we need to rapidly flit between processes running for a brief slice of time before blocking. To 
make this a low-cost operation, we use a priority-ordered run queue of process pointers to rapidly select 
the next process to run when it’s time to switch. This is now initialized to permit the context switch code 
to be run (as it will be called when we block for I/O operations). 

Virtual Memory Subsystem 

As mentioned in the previous article, 386BSD has been rewritten to use a new virtual memory system with 
greater capabilities. This new package, derived from MACH version 2, possesses generalized mechanisms 
which allow management of multiple regions of virtual memory within the user processes and kernel 
itself, thus avoiding the arbitrary and idiosyncratic methods used in earlier Berkeley UNIX virtual memory 
systems. This "new vm" is composed of machine-dependent (physical map) and machine-independent 
(virtual map) portions. 

This new virtual memory system was originally conceived in 1985 at Carnegie-Mellon University by 
Avadis Tevanian (now at Next) and Michael Wayne Young to provide an easily retargettable virtual 
memory system with the modern functionality required by the MACH operating system implementation. It 
serves as the basis for virtual memory systems in current MACH implementations, OSF/1, and Berkeley 
UNIX. 

To initialize the virtual memory system, all remaining pages of physical memory (not occupied by the 
kernel program itself) are each first allocated a resident page data structure (vm_page). Queues of free 
pages are created so that pages can be allocated from them. 

Next, virtual memory objects are created to provide an abstraction on which to hang collections of 
physical pages. To allocate virtual address space, virtual memory maps are also created to identify valid 
regions of virtual memory and the characteristics of these regions. The virtual memory system will 
associate virtual memory objects containing physical pages of memory with portions of address space 
mapped by a virtual memory map, as needed. 

We then initialize the kernel’s virtual address map and provide a mechanism to allocate portions of "wired 
down" memory to the kernel’s address space with a function called kmem_alloc. This function, the most 
primitive of storage allocators, allows us to allocate pages of memory dynamically in the granularity of 
pages at a time. 

With a memory allocator present, the initialization of the physical map (pmap) portion of the system is 
completed, allocating tables that will be used by the physical map module to track the association of 
physical pages of memory with the hardware address translation mechanisms data structures (Page 
Directory Table and Page Table Pages on the 386). At this point, the virtual memory system can allocate 
multiple address spaces and on-fault physical pages to legitimate references to previously mapped virtual 


131 



map regions. 


In designing a virtual memory system, the common drawback is the inherent complexity required. Not 
only does the system have to allocate virtual address space, but it also needs to allocate pages of memory 
to "back up" the virtual space. On some systems, the virtual memory system allocates space, grabs some 
pages, and manually wires them into the address translation map. With the new 386BSD virtual memory 
system, when you ask for memory from a memory allocator, both virtual and physical memory are 
allocated. In other words, you always get the memory in an address space. 

Another point to consider when designing virtual memory systems: Suppose we share the same pages in 
different processes. We may wish to "back up" shared pages that might be modified incrementally — thus, 
unique pages need be created only when the contents of a page change. This mechanism, called "copy on 
write," allows us to postpone or avoid entirely modifying a process’s memory. Only a mechanism to track 
changes is required. This is accomplished by copying virtual memory objects that shadow the original 
object. 

To complete the initialization of the virtual memory system, we must now initialize and activate "pagers," 
the software that reads in the contents of pages from the filesystem and stages pages in and out of 
processor memory to disk when we run short of "fast" storage. Pagers interface with external forms of 
information, such as local filesystems, disk swap partitions, disk drives, network filesystems, and the like. 

Kernel Memory Allocator 

Besides allocating pages of memory from the virtual memory system, we need a means of allocating 
smaller granularity objects. Many data structures, possessing short and long lifetime and generally in the 
order of 32 bytes in size, are allocated by the kernel on an "as needed" basis. UNIX provides user 
processes with a malloc() memory allocator for general-purpose memory allocation; the same type of 
function resides in the BSD kernel. This provides for a global heap store — so called because everything is 
kept in a heap, all piled together! 

Kernel malloc() uses the virtual memory system to obtain actual storage to manage (called an "arena"). 
This storage area encompasses the heap itself. After the vm system has been activated, we initialize our 
allocator. From this point on, we can dynamically allocate data structures. Older versions of BSD used 
statically allocated tables that minimally required the system to be patched and rebooted if a resource was 
overutilized — sometimes the system even had to be recompiled from its source code. With dynamic 
allocation, the configuration can be changed on a live system and the effect observed immediately. 

Device Startup 

Once enough of the system services are established, we can proceed to scale and configure tables 
appropriate for operation, among them the buffer cache and character list (clist) structures. While these are 
usually similar on most systems, a few have private buffer memory pools associated with devices (such as 
a disk array with onboard RAM) that should be specially arranged prior to system operation. Currently, 
the amount of disk buffering memory is chosen as a fixed percentage of memory at boot time, but work is 
underway to allow a more dynamic allocation scheme. 


132 



Next, we configure() devices in the system by walking a table of devices — calling each device driver’s 
probe routine with the parameters for each device and testing for the presence of each recorded device. 
Not all devices need be present. In fact, alternative addresses may be recorded for the same device. If the 
device is present, a probe routine will return true, with a subsequent call to the corresponding attach() 
routine to allocate resources (memory, interrupts, and so on) for the device and wire it into 386BSD. (In 
future articles, we will discuss how 386BSD dynamically structures the interrupt control devices 
on-the-fly.) 

After cpu_startup(), the system begins to schedule processes. We allow for this by enabling the 
rescheduling clock. This clock periodically interrupts the kernel and adjusts the priority of other processes 
that might compete for use of the processor. 


Mounting the Root 

We next initialize the virtual filesystem layer. We make our first reference to it by mounting the root 
filesystem and marking it as the top-level point from which to resolve filename references. The root 
filesystem, like other filesystems, can be of many different types. However, as this request is honored by 
code that calls successively lower-layer functions, we ultimately get to the bottom layers in the form of a 
device driver that extracts from the disk or network the external information of the filesystem on which all 
files are stored. If the root filesystem cannot be located, 386-BSD abruptly terminates. 


Final Machine Initialization 


Our final machine initialization step is to split process 0 into three processes (see Figure 1| ). This is done 
by creating separate copies of initial process 0 with the forkl() kernel service. forkl() implements the 
"replicate process" functionality used by the UNIX fork( ) system call. After being copied twice (creating 
process 1 and 2 — both blocked), process 0 will call the scheduling function sched(), which endlessly 
selects processes to shuttle in and out of secondary storage. In essence, it also manages to enforce a 
"fairness" policy on running executable processes present in RAM memory. If sched() finds nothing to do 
(as it will at this stage of the system’s life), it will block, waiting to wake up when things need to be 
shuffled again. 


When process 0 blocks, process 1, which has been patiently waiting since forkl() was invoked, can be 
run. Process 1 is then furnished a user address space with a tiny user program inserted into it. The user 
process is then transferred. The first instruction is to execute a file on the root filesystem (/sbin/init). Thus, 
our tiny bootstrap program, wired into the kernel, pulls in a much larger UNIX program located in the 
root. Even better, the init program is created with the same tools, operates in the same protected fashion, 
and functions with the same system calls as any UNIX program. This means we can use the richness of the 
program environment to build a more elaborate degree of functionality as the system boots itself up. At 
some point, however, process 1 will block (perhaps waiting for the disk to find a block of data for init). At 
this point, another process can be run. 


Process 2, yet another copy of process 0, is given the chance to run at this point. It will immediately call 
the pageout() function, the sole purpose of which is to scout out pages of underutilized memory (that is, 
held by some process, but not being used). This compulsive little function varies its activity depending on 
the amount of unused memory available. If little memory is available, it rapidly bails water, forcing pages 
of processes out to secondary storage (swap space) to prevent the system from becoming constipated due 


133 




to lack of memory. If plenty of memory exists (as does at the start of system boot up), it blocks waiting for 
a more desperate time. 

Processes 0 and 2 are system processes that only ran in the kernel — as endlessly looping functions, they 
provide a special service when awakened. Process 1, on the other hand, is an ordinary user process 
running code loaded from the root filesystem. Among other niceties that our init program provides, it 
offers a command interpreter through the use of the UNIX fork() and exec() primitives. In Figure 2 , for 
example, fork() and execve() system calls are successively used to replicate a new process (process 3) 
and execute the default command interpreter (or shell) /bin/sh. In turn, the shell will follow the same 
mechanism to create more processes and fill them with programs the user requests. The thick grey line in 
Figure 2 delineates the state of the world by the end of main() in the kernel. The asterisk represents the 
point where the first user instruction is executed, while below the line all remaining initialization, done by 
user processes, occurs. 

The two system processes provide a synchronous mechanism (remember, the high layers are synchronous) 
to rectify resource imbalances. By possessing the complete resources of a process, each can use the 
kernel’s versatility, including blocking operations that the asynchronous lower layer routines are forbidden 
to use (such as requesting disk I/O). 

Summary 

In this article, we have just touched on the layout of our generic 386BSD system (4.3 > x < 4.4), and 
introduced many of the mechanisms, data structures, and relationships between them. Our point is not to 
provide exhaustive descriptions of the operation of BSD in general, but to provide enough background to 
understand the operation of 386-related code, as well as design choices. 

To accomplish this task, we’ve purposely not described much of the detail of the various BSD subsystems; 
it is sufficient at this point if you have obtained some notion of what they are and why we need to turn 
them on in the order that we do. In conducting a port, one actually makes it through this body of code 
pretty quickly. It is the ticklish operations of fork, exec, and process context switching that get the first 
shakedown journey and surprises. Also, when the kernel design has been refined, and much of this code 
revised, this area continues to present challenges. 

In the next article, we will leave the hand-waving descriptions of process switching behind and dig into 
some actual code. In particular, we shall examine sleep(), wakeup(), and swtch(), and how the three of 
these bring off the illusion of multiple simultaneous process execution on a sole processor. We will also 
delve into why the UNIX paradigm shifts comparatively easily when it comes to multitasking, and why 
it’s been such a long uphill climb to move others (notably MS-DOS and Finder) into preemptible 
multitasking. Finally, we will discuss some of the requirements for the extensions needed to support 
multiprocessor and multithreaded operation in the monolithic 386BSD kernel. 

386BSD Availability 

The Computer Systems Research Group at the University of California Berkeley has announced that the 
BSD Networking Software Release 2—which includes 386BSD—is now available for licensing. The 
distribution is a source distribution only, and does not contain program binaries for any architecture. Thus 
it is not possible to compile or run this software without a preexisting system that is installed and running. 
In addition, the distribution does not include sources for a complete system. It includes source code and 


134 





manual pages for the C library and approximately three-fourth of the utilities distributed as part of 
4.3BSD-Reno. The software distribution is provided on 1/2-inch 9-track tape and 8mm cassette only. For 
specific information, contact the Distribution Coordinator, CSRG, Computer Science Division, EECS, 
University of California, Berkeley, CA 94720 or bsd-dist@CS.Berkeley.EDU or 
uunet!bsd-dist@CS.Berkeley.EDU. 

[LISTING ONE] 

/* Copyright (c) 1986, 1989, 1991 The Regents of the University of California. 

* All rights reserved. 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions 

* are met: 

* 1. Redistributions of source code must retain the above copyright 

* notice, this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright 

* notice, this list of conditions and the following disclaimer in the 

* documentation and/or other materials provided with the distribution. 

* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: 

* This product includes software developed by the University of 

* California, Berkeley and its contributors. 

* 4. Neither the name of the University nor the names of its contributors 

* may be used to endorse or promote products derived from this software 

* without specific prior written permission. 

* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS "AS IS'' AND 

* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 

* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 

* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE 

* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 

* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS 

* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) 

* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 

* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY 

* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF 

* SUCH DAMAGE. 

* @(#)proc.h 7.28 (Berkeley) 5/30/91 
*/ 

#ifndef _PROC_H_ 

#define _PROC_H_ 

♦include /* machine-dependent proc substruct */ 

/* One structure allocated per session. */ 
struct session { 

int s_count; /* ref cnt; pgrps in session */ 

struct proc *s_leader; /* session leader */ 

struct vnode *s_ttyvp; /* vnode of controlling terminal */ 

struct tty *s_ttyp; /* controlling terminal */ 

char s_login[MAXLOGNAME]; /* setlogin() name */ 

}; 

/* One structure allocated per process group. */ 
struct pgrp { 

struct pgrp *pg_hforw; /* forward link in hash bucket */ 

struct proc *pg_mem; /* pointer to pgrp members */ 


135 



struct session *pg_session; /* pointer to session */ 

pid_t pg_id; /* pgrp id */ 

int pg_jobc; /* # procs qualifying pgrp for job control */ 

}; 

/* Description of a process. This structure contains information needed to 

* manage a thread of control, known in UNIX as a process; it has references 

* to substructures containing descriptions of things that process uses, but 

* may share with related processes. Process structure and substructures are 

* always addressible except for those marked "(PROC ONLY)" below, which might 

* be addressible only on a processor on which the process is running. */ 
struct proc { 

struct proc *p_link; /* doubly-linked run/sleep queue */ 

struct proc *p_rlink; 

struct proc *p_nxt; /* linked list of active procs */ 

struct proc **p_prev; /* and zombies */ 

/* substructures: */ 

struct pcred *p_cred; /* process owner's identity */ 

struct filedesc *p_fd; /* ptr to open files structure */ 

struct pstats *p_stats; /* accounting/statistics (PROC ONLY) */ 

struct plimit *p_limit; /* process limits */ 

struct vmspace *p_vmspace; /* address space */ 

struct sigacts *p_sigacts; /* signal actions, state (PROC ONLY) */ 

#define p_ucred p_cred->pc_ucred 

♦define p_rlimit p_limit->pl_rlimit 

int p_flag; 
char p_stat; 

pid_t p_pid; /* unique process id */ 

struct proc *p_hash; /* hashed based on p_pid for kill+exit+... */ 
struct proc *p_pgrpnxt; /* pointer to next process in process group */ 
struct proc *p_pptr; /* pointer to process structure of parent */ 

struct proc *p_osptr; /* pointer to older sibling processes */ 

/* The following fields are all zeroed upon creation in fork */ 

♦define p_startzero p_ysptr 

struct proc *p_ysptr; /* pointer to younger siblings */ 

struct proc *p_cptr; /* pointer to youngest living child */ 

/* scheduling */ 

u_int p_cpu; /* cpu usage for scheduling */ 

int p_cpticks; /* ticks of cpu time */ 

fixpt_t p_pctcpu; /* %cpu for this process during p_time */ 

caddr_t p_wchan; /* event process is awaiting */ 

u_int p_time; /* resident/nonresident time for swapping */ 

u_int p_slptime; /* time since last block */ 

struct itimerval p_realtimer; /* alarm timer */ 

struct timeval p_utime; /* user time */ 

struct timeval p_stime; /* system time */ 

int p_traceflag; /* kernel trace points */ 

struct vnode *p_tracep;/* trace to vnode */ 

int p_sig; /* signals pending to this process */ 

/* end area that is zeroed on creation */ 

♦define p_endzero p_startcopy 

/* The following fields are all copied upon creation in fork */ 
sigset_t p_sigmask; /* current signal mask */ 

♦define p_startcopy p_sigmask 

sigset_t p_sigignore; /* signals being ignored */ 

sigset_t p_sigcatch; /* signals being caught by user */ 

u_char p_pri; /* priority, negative is high */ 

u_char p_usrpri; /* user-priority based on p_cpu and p_nice */ 


136 



char p_nice; /* nice for cpu usage */ 

struct pgrp *p_pgrp; /* pointer to process group */ 

char p_comm[MAXCOMLEN+1]; 

/* end area that is copied on creation */ 

#define p_endcopy p_wmesg 

char *p_wmesg; /* reason for sleep */ 

struct user *p_addr; /* kernel virtual addr of u-area (PROC ONLY) */ 
swblk_t p_swaddr; /* disk address of u area when swapped */ 
int *p_regs; /* saved registers during syscall/trap */ 

struct mdproc p_md; /* any machine-dependent fields */ 
u_short p_xstat; /* Exit status for wait; also stop signal */ 
u_short p_acflag; /* accounting flags */ 

}; 

♦define p_session p_pgrp->pg_session 

♦define p_pgid p_pgrp->pg_id 

/* Shareable process credentials (always resident). Includes a reference to 

* current user credentials as well as real and saved ids that may be used to 

* change ids. */ 
struct pcred { 

struct ucred *pc_ucred; /* current credentials */ 

uid_t p_ruid; /* real user id */ 

uid_t p_svuid; /* saved effective user id */ 

gid_t p_rgid; /* real group id */ 

gid_t p_svgid; /* saved effective group id */ 

int p_refcnt; /* number of references */ 

}; 

/* stat codes */ 

♦define SSLEEP 1 /* awaiting an event */ 

♦define SWAIT 2 /* (abandoned state) */ 

♦define SRUN 3 /* running */ 

♦define SIDL 4 /* intermediate state in process creation */ 

♦define SZOMB 5 /* intermediate state in process termination */ 

♦define SSTOP 6 /* process being traced */ 

/* flag codes */ 

♦define SLOAD 0x0000001 /* in core */ 

♦define SSYS 0x0000002 /* swapper or pager process */ 

♦define SSINTR 0x0000004 /* sleep is interruptible */ 

♦define SCTTY 0x0000008 /* has a controlling terminal */ 

♦define SPPWAIT 0x0000010 /* parent is waiting for child to exec/exit */ 

♦define SEXEC 0x0000020 /* process called exec */ 

♦define STIMO 0x0000040 /* timing out during sleep */ 

♦define SSEL 0x0000080 /* selecting; wakeup/waiting danger */ 

♦define SWEXIT 0x0000100 /* working on exiting */ 

♦define SNOCLDSTOP 0x0000200 /* no SIGCHLD when children stop */ 

♦define STRC 0x0004000 /* process is being traced */ 

♦define SWTED 0x0008000 /* another tracing flag */ 

♦define SADVLCK 0x0040000 /* process may hold a POSIX advisory lock */ 

♦ifdef KERNEL 

/* We use process IDs <= PID_MAX; PID_MAX + 1 must also fit in a pid_t 

* (used to represent "no process group"). */ 

♦define PID_MAX 30000 

♦define NO_PID 30001 

♦define PIDHASH(pid) ((pid) & pidhashmask) 

♦define SESS_LEADER(p) ((p)->p_session->s_leader == (p)) 

♦define SESSHOLD(s) ((s)->s_count++) 

♦define SESSRELE(s) { \ 


137 



if (—(s)->s_count == 0) \ 
FREE(s, M_SESSION); \ 


} 

extern int pidhashmask; /* in param.c */ 

extern struct proc *pidhash[]; /* in param.c */ 

struct proc *pfind(); /* find process by id */ 

extern struct pgrp *pgrphash[]; /* in param.c */ 

struct pgrp *pgfind(); /* find process group by id */ 

struct proc *zombproc, *allproc; /* lists of procs in various states */ 

extern struct proc procO; /* process slot for swapper */ 

struct proc *initproc, *pageproc; /* process slots for init, pager */ 

extern struct proc *curproc; /* current running proc */ 

extern int nprocs, maxproc; /* current and max number of procs */ 

#define NQS 32 /* 32 run queues */ 

struct prochd { 

struct proc *ph_link; /* linked list of running processes */ 
struct proc *ph_rlink; 

} qs[NQS]; 

int whichqs; /* bit mask summarizing non-empty qs's */ 

#endif /* KERNEL */ 

#endif /* !_PROC_H_ */ 

Figure 2 


A Process's Auxiliary Oata Structures 

credentials (p->p_cred) -— A “proc'' slot (p->)-► kernel stack (p->p_addr) 

file descriptors (p->p_fd) * signal actions (p->p_sigacts) 

administrative limits (p->p_limrts) address space (p->p_vmspace) 

runtime statistics (p->p_stats) * * process groups and sessions (p->p_pgrp) 


Figure 2: Process auxiliary data structures 


Figure 2 


138 



A Process's Auxiliary Oata Structures 

credentials (p->p_cred) -— A “proC' slot (p->)-► kernel stack (p->p_addr) 

file descriptors (p->p_fd) * v . ^ signal actions (p->p_sigacts) 

administrative limits (p->p_limtts) address space (p->p_vmspace} 

runtime statistics (p->p_stats) * ^ process groups and sessions (p->p_pgrp) 


Figure 2: Process auxiliary data structures 

Figure 1 




Figure 1: The initial processes generated by the kernel: (a) main kernel 
processes; (b) user mode processes. 


Figure 1 


139 






Figure 1: Tbe initial processes generated by the kernel- (a) main kernel 
processes; fb) user mode processes. 


Figure 2 


A Process’s Auxiliary Data Structures 

credentials (p->p_cred) *+ -— A “proc’' slot (p->)-► kernel stack (p->p_addr) 

file descriptors (p->p_fd) * ^ signal actions (p->p_sigacts) 

administrative limits (p->p_limrts) * S' \ address space (p->p_vmspace) 

runtime statistics (p->p_stats) * * process groups and sessions (p->p_pgrp) 


Figure 2: Process auxiliary data structures 


Porting Unix To The 386: The Basic Kernel: Multiprogramming 
and Multitasking, Part One 

An overview of the multiprogramming paradigm in 386BSD. Conventions, definitions, and organization 
of multiprogramming. 


140 





Multiprogramming and Multitasking, Part One 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory micro-processor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. (c) 1991 TeleMuse. 

Last month, we began our preliminary discussion of the main() procedure in the 386BSD kernel. This 
procedure is critical to UNIX, and we will be referring to it again and again as we introduce more 
functionality in our 386BSD kernel and incrementally turn on all of the kernel’s internal services. We 
examined the validation and debugging of kernel functions up to executing the main() procedure, 
processor initialization, turning on paging, and sizing memory, among other things. We then went on to 
create our initial page fault and context generation for the scheduler, paging, and first user processes. In 
other words, by building much of the main() procedure, we also assembled the framework for the next 
outer layer of our UNIX operating system — which, in turn, will initialize higher-level portions of the 
kernel. 

This month and next, we turn to a seemingly unrelated area: multiprogramming and multitasking. Their 
relevance lies in that beginning with this stage of our port, the code grows more complex as more items 
interact with each other; as such, we are often working simultaneously on many key portions. This is one 
reason why ports are often begun with great hopes and aspirations, but abandoned as the complexity 
increases. 

Multiprogramming and multitasking, two important elements which help make UNIX "UNIX," are also 
areas of the basic design that the creators did particularly well; hence, they are instructive to anyone 
interested in operating systems design. Many other popular operating systems have only recently allowed 
for multiprogramming — at great cost and incompletely — even though it was "planned for" in their early 
designs. 

This month, we will reexamine our understanding of the conventions—the style of programming portions 
of the operating system—inherent in the design of UNIX. These conventions are the conceptual framework 
upon which a multiprogramming operating system is built. Once these conventions are understood, the 
ease and simplicity of the UNIX multiprogramming environment contrasts markedly with the more 
convoluted attempts at multiprogramming in later operating systems. 

Next month we'll examine some actual code; in particular, sleep(), wake-up(), and swtch(), and how they 
create the illusion of multiple, simultaneous process execution on a sole processor. We’ll discuss some of 
the requirements for the extensions needed to support multiprocessor and multithreaded operation in the 
monolithic 386BSD kernel and reexamine the multiprogramming attempts of some other operating 
systems in light of what we have learned. 


141 



What is Multiprogramming? 


Occasionally, an area is so misunderstood or shrouded in mysticism that the simple and elegant 
explanation is ignored in favor of an obtuse or complicated one. Multiprogramming is so elemental to the 
design of any operating system that ignorance of its development and structure precludes any real 
understanding of UNIX. It is a concept which appears obvious to the typical user (and for good reason): 
Through multiprogramming we can leverage or work on several tasks at once. Obviously, concurrently 
running several editors, formatters, and output devices would be as important to a writer as simultaneously 
running a compiler, debugger, and editor is to a programmer. 

The term "multiprogramming" implies multiple applications active at any time. In other words, it is the 
effect we can see of having access to and working with these programs. A UNIX system generally allows 
a fair number of simultaneous applications programs to be present at a given time, a convenience to which 
the average UNIX user quickly becomes accustomed. (While we wrote this article, there were four editor 
processes, two command processors or shells — one of which was on a different host on the network — and 
a contact management program present. This is a typical level of activity for one or two people doing a 
little writing.) While the computer can cause them all to appeal - active in a wink of an eye, we can't use 
them nearly as fast, so we just hop between programs, calling more as needed. 


More Details. 


Attempts at Multiprogramming: MS-DOS and Finder 

UNIX users become so spoiled with the ability to multiprogram that they don’t fair well when migrated 
back to a non-multiprogramming environment such as MS-DOS and Finder. Not surprisingly, later 
versions of these have attempted to compensate for this lack with limited extensions such as line printer 
spooling (via a TSR in MS-DOS for example). These extensions are by no means perfect, however. For 
example, certain applications programs invoked from within nested command processors invariably fail, 
sometimes with catastrophic results. 

Windows 3.0 for the 386 attempts to increase multiprogramming capabilities beyond earlier MS-DOS 
extensions by running a protected-mode, preemptive-scheduling operating system with multiple MS-DOS 
partitions. It gets around the problem by simulating a separate complete MS-DOS environment for each 
task using the virtual 8086 mode on the processor itself. In other words, it effectively places a hard shell 
around each MS-DOS "task," as if it were running on a separate processor. 

Internally, the software looks somewhat like UNIX. Through the 8086 mode feature of the 386 processor, 
Windows/386 can leverage some UNIX approaches to multiprogramming. The MS-DOS sessions end up 
running in "almost" real mode while the rest of the operating system, like 386BSD, runs in protected mode 
in a completely different part of the system. Even some of the bugs encountered (and fixed) in porting 
386BSD to the PC are similar to those in extant protected-mode systems like Windows/386. 

On the Macintosh, applications can be launched and switched, but again the focus is more intraapplication 
than interapplication. System 7.0 with Multifinder is supposed to deal with this area more 
comprehensively. (We shall see.) 


142 




The question as to why multiprogramming was more difficult to achieve in MS-DOS and Finder on the 
Macintosh, for example, can be turned around and phrased in a more direct manner: "Where in UNIX is 
the concept of multiprogramming implemented (if there is such a place), and what elements of 
multiprogramming are missing from other operating systems that made extensions for multiprogramming 
more difficult later on?" 

In UNIX, the concept of multiprogramming was extant from the beginning, so let’s examine what the 
designers of UNIX did right. 

Conventions and Definitions 

To the user, multiprogramming is the ability to use and access many programs at once. Organick, 
however, views multiprogramming as "Systems for ’passing the processor around' among several 
processes, so as to prevent the idling of a CPU during I/O waits or other delays, are known as 
multiprogramming systems." While utilitarian and concise, this definition yields little insight to the 
layman. 

To make the concept of multiprogramming more obvious, we must review some fundamental terms: 
processes and tasks, context switching, preemption and multitasking, and time slicing. We’ll also contrast 
our understanding of these terms with those of other designers who helped develop the functional 
definitions. 

Processes and Tasks 

Programming on UNIX is predicated upon the existence of processes. Dennis and Van Horn’s is a good 
basic definition: "A process is a locus of control within an instruction sequence. That is, a process is that 
Abstract entity which moves through the instructions of a procedure as the procedure is executed by a 
processor." This definition describes a dynamic entity practically "swimming" through the code but does 
not tie us down by saying exactly what a process consists of or how it works through the code. 

Organick gives a more functional definition, inherited from the precursor to UNIX, MULTICS: "Process 
(Lay Definition): A set of related procedures and data undergoing execution and manipulation, 
respectively, by one of possibly several processors of a computer." Again, our definition is not "absolute." 
We run into this problem with a few other terms, in particular, lightweight processes and threads. (See the 
sidebar entitled "Brief Notes: Lightweight Processes and Threads.") Modern UNIX systems utilize 
lightweight processes, as opposed to the heavyweight, let’s-do-everything MULTICS processes. 

Processes, by their nature, are isolated entities. Tasks, on the other hand, share resources such as variables 
and memory. A task can be a subset of a process (such as a thread) or live outside of a process (in a 
portion of memory on the system, for example). A task is really just some register state and a bag of 
memory that can be accessed by other tasks. Keep in mind that this concept is more primitive than that of 
a process. 

Tasks are "well-behaved" when they process the event that activated them without locking out other 
cooperating tasks (that is, not doing time-consuming operations) and when they relinquish execution to 
give other lower-priority tasks a chance to run. Tasks must be careful to arbitrate for shared resources 
before they use them. 


143 



Real-time tasks perform many operations usually done within the unseen internals of a time-sharing 
system. Thus, the transparency so touted in a timesharing system hinders the external visibility needed for 
real-time operation. (See the sidebar entitled "Brief Notes: Is UNIX Real-Time Enough?") 

Context Switching 

The actual operation of going from one process to another is called a context switch. When this occurs, the 
state information of the computer’s processor (registers, mode, memory mapping information, coprocessor 
registers, and so on) or "context," is saved away in a location where it can be later restored. Then, a new 
process which must be run is found. Finally, the state information of the new process must be loaded and 
run. 

The word "context" is used to refer to different, but related things. When a processor gets an interrupt, or a 
procedure is called, the state information of "what the processor was doing before" is always recorded 
(somewhere — it depends on the computer and system). This might be termed an "interrupt context switch" 
or a "procedure context switch." These are different from a process (or task) context switch. 

A process context switch can be a costly operation, especially if a processor has a large number of 
registers. The additional overhead of this operation is just another unwelcome burden to a nonmultitasking 
operating system — the price for doing n things at once. This overhead is created in proportion to how 
much multitasking is done at a time. Remember, many active processes result in many context switches. 

Preemption and Multitasking 

Simple multitasking systems can be written that run one task at a time and switch to another task when 
idling. These systems are nonpreemptible, because a process is not allowed to preempt or run ahead of the 
currently active process. This mechanism does not allow for much flexibility. Early examples of 
nonpreemptible multitasking systems included MPM and UCSD Pascal. 

MS-DOS, in contrast, is a single-tasking system. To illustrate the difference, let’s assume we are running a 
program, such as a number cruncher, on both our nonpreemptible multitasking system and our 
single-tasking MS-DOS system. If the program allows for no interruptions, it will run to completion on 
both systems. Let’s assume, however, that the programmer who wrote this program installed a request for 
input midway through the program. At the point the request for input appears and the compilation is 
stopped, we can "put aside" the nonpreemptible multitasking system’s program, run another task, and then 
return to input the data and continue the program. On our single-tasking MS-DOS system, however, we 
cannot put aside the program midway through; we must either input the requested data and complete the 
compilation or abort it. 

A preemptible multitasking system is far more interesting for our purposes. Unlike our nonpreemptible 
multitasking system, we can not only run one task at a time and then switch to another task when idling, 
but one task can also preempt automatically, without manual intervention. This concept is quite powerful 
in practice, but adds its own complications, as we shall see. 

Obviously, preemptive systems must have a way of grabbing control and preempting the current process. 
Therefore, preemption mechanisms are either "immediate" or "casual" in action, depending on the amount 
of time available to activate a process to run. In a real-time system, an immediate guaranteed response 
time is crucial; with a time sharing system, even a ponderous one fourth of a second (approximately 


144 



500,000 386 instructions) response time is unnoticeable. 

Preemption adds two major costs: First, it implies more context switches (hence, increased overhead); and 
second, nonpreemptible sections must be carefully coded to remain nonpreemptible. Additional code is 
usually required to surround nonpreemptible sections and when contending for shared resources that may 
become active. (These nonpreemptible sections of code are called "critical" because they must execute 
without preemptions, else the integrity of the system is impacted.) These costs can increase, depending on 
the degree of preemption allowed. 


More Details. 


Unlike a real-time system, UNIX possesses no ability to preempt a higher priority process when the kernel 
is already processing something that cannot be "blocked." "Wakeup" calls can schedule a higher priority 
process that wants to be run, but up to an entire rescheduling clock tick can pass before this shuffling of 
the schedule is noticed. 

Time Slicing 

An important consideration with timesharing systems in general is to ensure that each process receives a 
period of time to run on the processor when it is required, and that, when many processes are ready, the 
system’s scheduler "round-robin" and tender the appropriate time to the processes. With our preemptible 
multitasking system, we can afford to give processes a period of activity, called a timeslice, to switch them 
back and forth. (A UNIX timeslice has a lifetime of one-tenth of a second.) A process can run up to the 
lifetime of a timeslice unless some other process intrudes (preempts). In contrast, a real-time system will 
only activate a well-behaved task per a given event. 

The system’s scheduler determines the rules detailing which process is allowed to run. Usually, the 
policies the scheduler uses to manage resources (processor time, RAM, I/O bandwidth, preferential use) 
have a root goal in mind. With UNIX, the concept of "fairness" is invoked, in that processes from multiple 
users are allowed to compete for resources on an even basis. While this is appropriate for a time-sharing 
system, a real-time system would require us to have an intentionally "unfair" scheduling policy in place — 
one that would award resources to a task solely on the basis of its runtime priority and event occurrence. 

UNIX Organization for Multiprogramming 

From the very beginning, UNIX was conceived of as a "preemptible multitasking system," on top of which 
lightweight processes are built. Preemptible multitasking occurs at the internal kernel level of the system 
and its mechanics are transparent to the user. Multiprogramming is built upon these mechanisms and is 
what is observed at the user level. It is more concerned with the effects of our preemptible multitasking 
system on the user (in other words, how we interact with the system). By the way, if we have more than 
one processor, we are also doing multiprocessing. Got all that? Good, because these terms are thrown 
about all the time with little distinction between them, and they are distinct. 

UNIX utilizes a limited-preemption mechanism that provides each process a timeslice which it can 
consume. Because its goals are oriented towards timesharing, such timeslices are made perceptibly short 
to maintain the illusion of timesharing. The mechanism for providing this is simple and elegant, avoiding 
"ultimate" mechanisms that impose complexity throughout the kernel. The trade-off for this simple 
approach is a minimal, submillisecond response time delay, event-to-process — not a great loss for a 


145 




time-sharing system. 

Having obtained an overview of multiprogramming and the related terminology, we can now delve inward 
to the actual mechanisms by which processes are created, switched around, and terminated. 

A UNIX Process’s Double Life 

A UNIX process possesses both a user-mode program and a supervisor-mode "alter ego." The user-mode 
applications program runs code and obtains service from the operating system via system calls. This in 
turn causes the computer to enter supervisor mode and run the operating system’s code, which processes 
the exception (system call) for this process. During the processing of a system call, or other exception, 
multitasking code comes into play. Because the computer system’s periodic clock interrupt forces entry 
into the operating system (usually every 100 times a second), even a process that does not call the system 
by itself (a heavy calculation) mandatorily calls the system in this manner. 

Blocking and Unblocking Processes 

A process waiting for an event can give up the processor by calling the tsleepQ routine to "block" itself out 
from the processor until the expected event occurs. It then frees up the processor to run another process. 
tsleep() records the priority and other details of its slumber. The code checks to see if the event for which 
it was waiting has occurred. If not, another tsleepQ is issued. 

When the event occurs, processes waiting for it are "unblocked" by a wakeupO call. The wakeupO call 
reschedules the previously blocked process, allowing it to run when it next has priority. wakeupO and 
tsleep() are not necessarily discreetly paired; one wakeupO ca ll can awaken many tsleep()ing processes 
waiting for the same event. This explains why a process is not guaranteed that the condition waited for is 
true; if one buffer becomes free, a dozen processes may wake up for it. Only one is satisfied, so the others 
return to their slumbers. 

Process Context-Switching Mechanism 

In order to provide multiprogramming capability, we must be able to exchange the currently running 
program with the next program to be run whenever the current process blocks for an event or the currently 
allocated timeslice of the process is consumed. This switch from one process to another, or process 
context switch, is a very critical piece of code, and is the pivotal mechanism for multiple execution. 

Interestingly enough, a process context switch is similar to a subroutine call mechanism called a 
"coroutine." Coroutines are not nested within a hierarchy like subroutines, but instead reside at the same 
level. When a coroutine pauses, it returns execution to its caller, knowing that it will be reentered at some 
future date when its caller suspends. 

Next Month 

Our next step is to discuss in depth the 386BSD switch() routine, and how it impacts multiprocessing 
capabilities to 386BSD. We’ll embark upon those subjects next month. 


146 



References 


Organick, Elliot I. The Multics System. Cambridge, Mass.: MIT Press, 1972. 

Dennis, J.B. and Earl C. Van Horn. "Programming Semantics for Multiprogrammed Computations." 
CACM, vol. 3, no. 9 (1966). 

Brief Notes: Lightweight Processes and Threads 

On MULTICS, processes were so "heavy-weight" that a fair amount of work was done to reuse them, 
rather than create and destroy them constantly. In a similar vein, VMS processes are "precreated" before 
use to minimize activation time, "cleaned and pressed" after use, in preparation for reuse, and then 
terminated. UNIX, on the other hand, relies on lightweight processes because it uses one (or many) for 
each command executed through the shell. In other words, processes are so convenient to work with (each 
has a completely independent and isolated program within) that we want to use them commonly. This is 
why we make them lightweight in UNIX — so we can cheaply create and destroy them as needed. 

During the mid-1980s, research versions of UNIX found that a limiting factor in the speed and efficiency 
of UNIX might be, in part, that lightweight processes were not lightweight enough. For example, to 
perform a UNIX fork() operation to clone a process, the early versions of UNIX were required to copy an 
entire process, frequently unnecessarily (such as copy a 200-Kbyte text editor only to execute a 10-Kbyte 
command interpreter). In order to preserve the lightweight nature of UNIX processes, research systems 
implemented "copy-on-write" fork(), where only the pages actually modified would be copied. (All pages 
in the copy are marked "read-only;" as write operations are detected and faulted, the affected pages are 
copied and marked "read-write.") Copy-on-write is now a de facto standard in UNIX systems. Alas, 
copy-on-write required duplication of page table and other bookkeeping information, so it was not 
considered "lightweight enough" for some. 

One extreme solution was a mechanism called vfork() — it literally stole the parent process’s address space 
en masse and used it temporarily until it executed another program, and thus avoided copying page tables 
and bookkeeping information. This sounds wonderful until you realize that the program using vfork() has 
to be carefully written to avoid inadvertently modifying the pa lent process (because both child and parent 
process run using exactly the same memory). Also, due to the weird semantics of vfork() (you must clean 
up after yourself, the child process always runs first, and so on), it’s not a general-purpose replacement for 
fork(). (In fact, in many cases you can’t employ it.) Finally, it’s not easy to debug, because a single 
program is "running" in two places. However, it still remains the cheapest way to spawn processes, even if 
it is a bit ugly and cumbersome. 

In the never-ending search for lighter-weight mechanisms, "threads" next come into view. Exactly what a 
thread is, however, varies from one system to another. Some view threads simply as tasks within a 
process. Others view them as Lightweight Processes (LWPs) that may share address space. Minimally, a 
thread must have a separately executing PC (program counter) or instruction pointer on the 386, although 
to be practical, we also suggest a stack. Hopefully, the cost of creation and context switching would then 
be so low that we could program in terms of thousands of threads. Typically, they would be used as ways 
for normally dormant functions to become active (by being scheduled to run), such as when an exception 
needs to be processed. Threads are also blockable. 


147 



Threads seem at first to provide a natural way to explore parallel processing or multiprocessing within a 
process, because you can now have a thread per processor. On second glance, however, threads aren’t an 
answer to the hard questions of parallel processing. In particular, for the past 30-40 years we have been 
working with sequential mechanisms. Now how do we delegate work to parallel instruction streams? In 
other words, how do we "think parallel"? Threads may be part of the answer, but, contrary to the popular 
literature, they are far from all of it. 

Leveraging the early idea that programs were in processes and could be used in each command, 
conventional UNIX multiprogramming permitted you to build complicated commands easily and tie them 
together in scripts and pipelines; this was valuable. 

However, at the risk of sounding skeptical, it is not clear that threads offer any immediate advantage on 
either a uniprocessor or multiprocessor system over the conventional UNIX multiprogramming with 
processes. To truly make use of threads, our program development tools (compilers, linkers, debuggers) 
must provide even more functionality than before. This is especially true with the debugger, as you need to 
know just what thread modified which global variable that caused another to generate an exception. 
Otherwise, multithread programming might resemble an impossible can of worms, subject more to the 
rules of black magic than professional practice. 

Watching this "slimming down" of processes, we can only speculate about what will occur next if this 
trend continues. Can we next expect subinstruction parallelization, or multi-threaded microcode? If so, it 
might fit in with the trend toward superscalar processors, where multiple operations are performed per 
each clock tick (that is, a 50-MHz processor that does hundreds of millions of operations per second). 
Perhaps programs will then be written in a two-dimensional hierarchical arrangement, and separately 
address parallel iteration in hardware and sequential iteration in time. We might end up with the neural net 
arrangement, itself an example of both time and space iteration. Whatever else occurs, the trend toward 
parallel execution will definitely shape the programming environments of the future. 


— B.J. and L.J. 

Brief Notes: Is UNIX Real-Time Enough? 

Our UNIX system model, with its 1 to 100 millisecond response time, seems more than adequate for its 
time-sharing role. However, we seem to be moving into a world where the elements of what is loosely 
described as "multimedia," namely voice, imagery, and other sensory information, are becoming more 
commonplace. These sources of information not only require a lot of I/O bandwith to make it onto and off 
of our system, but they also may require microsecond or perhaps even nanosecond response time (as for 
video). Does UNIX work well enough for the multimedia and networking of the future, or is it showing its 
age? 

On the surface, there a number of problems. By the time a process is activated, a voice response event of 
brief duration (such as saying the words "yes" or "no") may have come and gone unnoticed. Software 
delays can amount to substantial portions of time, so we lean towards customized hardware solutions as 
the only predictable way of guaranteed response. 


148 



This is not the end of the story, or the end of UNIX, however, as many elements of multimedia have 
already been demonstrated on extant UNIX systems sans customized hardware. Clever software can 
"buffer ahead" in order to make up for the occasional delay that mars the synchronous "playback" of 
information. The 1991 Summer USENIX presented several examples in the areas of music and video 
which were really quite fun to watch and work with. 

However, other problems remain, especially that of bandwith. One (relatively) quick way around this is to 
develop new protocols that synchronize and reserve bandwith on a network. For casual near-term needs, 
UNIX can function adequately in this way. 

Video, however, presents an even greater problem. Even when we utilize modern data compression 
techniques, current data rates push hardware and software to their limit. Video grows even more 
intractable when we start to consider future video (HDTV) bandwiths. Finally, if we want our software to 
transform the video in real time, we can't work with a compressed signal. For such applications, real-time 
systems may be required. The question now becomes, can our workstations serve all needs well? 

Real-time systems differ from time-sharing systems in that they tend to have their controls "exposed" 
instead of hidden. Scheduling is often controlled by a group of tasks that cooperate, and arbitrary 
preemption is the rule, not the exception. UNIX primitives don't always fit nicely into this world, and 
most existing real-time systems have different proprietary interfaces, complicating matters even further. 
POSIX has people working on real-time standards, so there may be hope for a standard interface yet, but 
not soon. As a way around the problem, a system could conceivably work both interfaces at once, by 
building mechanisms on top of UNIX mechanisms (this has been done by some manufacturers), or we 
could rewrite UNIX to provide guaranteed real-time response (as has been done by other manufacturers). 

In the long run, the conflict over real-time operation verses transparent time-sharing may be an intractable 
problem which only a successor to UNIX may correct. For the short-term, systems which appear to "mix" 
real-time and time-sharing UNIX (like those systems with separate dedicated processors) will probably 
suffice, but the economics of the marketplace will eventually demand a less costly or less complicated 
solution. 


— B.J. and F.J. 

Porting Unix To The 386: The Basic Kernel: Multiprogramming 
and Multitasking, Part II 

How multiprogramming is achieved via multitasking. A discussion of the process. Alternative 
implementations and trade-offs. A reflection on why it has been so difficult to add multiprogramming to 
non-UNIX operating systems such as MS- DOS. 

Multiprogramming and Multitasking, Part II 


149 



William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and the chief architect of National Semiconductor’s 
GENIX project, the first virtual memory microprocessor-based UNIX system. Prior to establishing 
TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric Computer 
Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or comments to 
lynne@berkeley.edu. (c) 1991 TeleMuse. 

In last month’s installment, we embarked upon an exploration of multiprogramming and multitasking — 
two of the important elements which help make UNIX "UNIX." Now that we have examined these key 
UNIX conventions and have developed the intellectual framework for multiprogramming, we will now 
proceed to an examination of some actual code, in particular sleepQ, wakeupO, and switch!) and how 
these three programs carry off the illusion of multiple simultaneous process execution on a sole processor. 

Following our discussion of the 386BSD switching mechanisms, we will then discuss some of the 
requirements for the extensions needed to support multiprocessor and multithread operations in the 
monolithic 386BSD kernel. Finally, we will reexamine the multiprogramming attempts of some other 
operating systems in light of what we have learned. 

The 386BSD swtch() Routine 

The process context switching routine in our 386BSD kernel that provides the actual functionality is called 
swtch(). In a nutshell, swtchQ stores the process’s context in a pcb (process context block) data structure, 
finds the next process to run from those waiting to be run, loads the new process’s state, and then executes 
a return. Note that in this case, the call to swtchQ occurred the last time the new process was running and 
calling swtch(). 

This mechanism relies on the fact that a process must effectively request a context switch to the next 
consecutive process, whatever it is. This way, either another process is run, or, if it is the only running 
process, it switches back to itself again. If there is no process to run, such as when the sole running process 
switches while waiting for disk I/O to complete, swtchQ must idle the processor and rescan the run queue 
when a process is added to the queue. 

The 386BSD swtchQ routine attempts to avoid work when possible, because at times, literally hundreds to 
thousands of switches per second may be demanded as we use a large number of processes running on top 
of the kernel. To reduce overhead, swtchQ does nothing to ensure that the data structures remain 
unchanged between the call and subsequent return from swtchQ. As such, other parts of the kernel must 
incorporate mechanisms concerned with critical sections (that we can implement as needed at those sites). 
Similarly, nothing in swtchQ is done to prevent deadlocks or even ensure that a process is ever executed 
again! (MULTICS, in contrast, has refined to a fine art the ability to avoid deadlocks.) By reducing the 
overhead on context switching, we also reduce the cost of a process. 

Other operating systems attempt to reduce the overhead on process switching by avoiding switches when 
possible and planning for them when forced to use them. The designers of UNIX took a different approach 
— make the mechanism for context switching simple and cheap, use it often, and build new mechanisms 
for handling other requirements on an as needed basis. In other words, multitasking comes first, and 
everything else falls within our multitasking framework. 


150 



Where Is swtch() Used? 


swtchQ is called only in the "top" layers of the kernel when a process, in kernel mode, blocks for a 
resource. To keep processes that stay in user mode from locking or rescheduling, swtch() is called as an 
exception at the end of interrupt processing, when we would be returning to the user-mode program. A 
periodic clock interrupt (used to recalculate the process priorities used by swtch() to determine who to run 
next) also ensures that a process can’t overstay its welcome even if no other interrupts occur. 

If a process is not running, it is either waiting to return to user mode (for instance, a higher priority process 
preempted it) or its kernel mode alter ego is waiting for some event or resource to "wakeup" so that it can 
be switched. In most UNIX systems, a process status command called ps can be used to differentiate 
between these two cases. In the first case, a process will be marked as "runnable" (R or SRUN). In the 
second, it will be marked as "sleeping" (S or SSLEEP) and be assigned a "wchan" (or wakeup event 
number) to indicate which event it is waiting for. 

Listing One (page 118) is a code fragment executed after a system call or interrupt, when the process is 
about to return to user mode. At this point, no resources capable of causing a deadlock condition are being 
held, so the process sees no semantic difference between running now or later — it is safe to switch. Note 
that policy decisions of when and who to switch to are made elsewhere. Also note that signals, including 
ones that might result in termination, are processed at this point. 

tsleep(), or "timed sleep" (see Listing Two| , page 118), is a blocking call in the BSD kernel that sets a 
process sleeping and switches away to run another process (or idle) until the event occurs. It works by 
marking the process as waiting for an event, inserting it on a sleep queue (frequently many processes sleep 
for the same event), and switching. In addition, tsleepO has options to abort a sleep after a time limit, as 
well as the ability to catch "signals" (that is, someone has killed the process and the code calling tsleep 
should clean up and let other routines, such as system calls, process the incoming signal). 

Unlike tsleepO, wakeupO (see Listing Three , page 118) does not cause control to pass to another process. 
It simply removes the block (indicated by the "chan" event), stopping processes that are sleeping for it. To 
reduce time spent finding processes, the sleep queue is hashed. Lound processes are added to the run 
queues. If the process is not loaded, the swap scheduler associated with process 0, the first process, is 
awakened to bring it in. (Of course, we won’t put it on a run queue and reschedule.) Note that is not an 
error if no processes are waiting for the event. The processes now placed on the run queue will be run (in 
order of priority and on a LILO order) as swtchQ is called. 

When swtch() is called by other portions of the kernel, caution must be observed — its "simple" nature 
should not cause more problems than it solves. Lor example, to forestall deadlocks and data structure 
corruption created by another process that may have been run in the interim, we must make sure that calls 
are "safe." These rules, by the way, are generic to the entire kernel. 

The Life Cycle of a Process 

By now you might have noticed that swtch() relies upon the existence of a current process, and possibly 
even other processes, to be summarily executed. But we still have not discussed how processes come into 
being and how they cease to exist — in other words, the life cycle of a process. We’ll briefly touch on these 
to complete the picture. 


151 




Processes are created by a fork() system call. The microscopic, inner portion of a fork() call must create a 
process context that can be swtch()ed to so that it can be run. This process context is created by a 
cpu_fork() routine. It builds the process context into the newly allocated process data structure, then adds 
it to the list of executable processes. 

Process termination is conducted by a similarly named routine, cpu_exit(). This routine, the inner part of 
the exit() system call, deallocates a process and its related data structures, and then switches away. The 
process no longer exists, so it never runs the risk of being switched back. This is also true for " ki lled" 
processes, because exitQ is called in the course of processing a signal. Exiting UNIX processes are, in 
effect, suicides. 

For those of you wondering why an apparently "unkillable" process occurred (a not uncommon 
occurrence), this is because the process "sees" a termination signal only when it returns to user mode. If 
the process is blocked waiting for an event or a resource, and the blocking mechanism does not notice the 
existence of the signal, it will continue to wait in vain for that resource. This problem usually arises from a 
programming error in a device driver or as an unintended deadlock that a new feature in the system just 
provoked. 

The Magic of swtch(): a Simple Scenario 

Now that we’ve outlined what swtch() does, we need to know how it brings off the illusion of 
multiprogramming — the magic, if you will. As a working example, we will now set up a hypothetical 
session: Three people, each running a process (each on a terminal), are typing simultaneously. 

User A presses a key. This causes the computer to interrupt User C’s readQ system call that was reading a 
file off the disk. User A’s key press is processed by the system and put on a character queue associated 
with User A’s process. A wakeup is then issued to this process, and the computer, in clearing the interrupt, 
returns to User C’s read() system call. 

User C’s process, in kernel mode, requests a block from the disk. It now blocks waiting for the disk, and 
switches. However, User B had a disk I/O process waiting even before this example began (yes, an 
argument for preexisting universes) and is now competing with User C’s process. 

User B’s process now becomes the current running process because it has more priority than User A’s 
process. User B’s process runs, returning to user mode and continuing the user program until its timeslice 
is used up. (The rescheduling clock interrupt routine periodically checks the timeslice as it monitors the 
passage of time, typically 60-100 times a second.) On the last clock interrupt, the rescheduling code sets 
the wantresched flag, and a call to swtchQ occurs just before the return to user mode. User A’s process is 
then selected to run. It then processes the received character in the top half of the system, completes its 
system call, and returns to the user. 

This entire scenario took place in the order of a hundredth of a second. Unlike the audiophile who claims 
to dislike CDs because "he can hear the sound breaking up" 44,000 times a second (good ears!), most 
mortals can’t resolve time this finely. However, if 70 users on an anemic little machine such as a 
PDP-11/70 all hit return at the same time, this tolerable hundredth-of-a-second delay would grow to 
encompass seconds. In fact, you could go get an espresso while running a simple compilation and get back 
in time for a prompt. (Except that someone would probably steal your terminal in the meantime. Yes, it 
used to be this way.) 


152 



This should illustrate the amount of work done by this tiny subroutine. As such, it’s time to dissect the 
innards of the 386BSD swtch() routine. 

A Closer Look at swtch() 

For all practical purposes, switchQ boils down to three functions: store, select, and load. We store the 
current process’s state, select the new process to be run, and load in the new process’s state. 

Storing and loading processor state is pretty simple (see Listing Four , page 118). All we really need to do 
is move processor registers appropriately and get a new address space. The key global variable within our 
BSD kernel is the curproc variable; it points to the active process at all times. From the process data 
structure (which acts as a directory of data structures for this process), we obtain from the p_addr field the 
address of the pcb to hold the context we will store and load. One element of that context, the page table 
base pointer %cr3 of the process, defines the address space of the user process page tables. 

Floating point is dealt with here by setting a bit on a processor special register %crO, which causes a trap 
on the next floating-point operation. In this way, we can delay the store and unload of the coprocessor 
until it’s absolutely needed. Storing and loading the "Numeric Coprocessor Extension" (NPX) is costly, 
and usually only one or two processes use it at a time, to minimize the impact on the remaining processes. 

For the system hardware, we have to remember the priority level of hardware interrupts (the cpl) and 
restore it as well. Naturally, we must also set curproc to point to the successor process. 

The mechanism for selecting the new process is more complicated. In 386BSD we have 32 run queues of 
ascending priorities. swtchQ is a consumer of the run queues, taking the leading process off the highest 
priority run queue and loading it. It will consume all processes off a high priority queue before consuming 
any from a lower queue. To quickly find which queue to consume first, the queue status variable whichqs 
records the "filled versus empty" status of all queues (32 queues corresponding to 32 bits, one bit per 
queue). With a single-bit scan instruction, we can determine if: 1. A process is on any queue at all (for 
example, if 0, no process to run); and 2.64 the highest priority queue that has a process in it. 

What if we can’t find a process to run? We must "idle," awaiting a process that can be serviced. In our idle 
loop, we reenable interrupts and execute the hit instruction to stop the processor. In reality, the hit 
instruction pauses until the next interrupt. We actually could do other things at this point (some systems, 
for example, scrub dirty pages), but we choose to idle the processor in an (usually vain) attempt to avoid 
stealing bandwidth from the bus (ah...but ISA has no other bus masters or other CPUs), or, in the case of a 
CMOS 386 in a laptop, save power (almost none are "low-power" CMOS). 

In order to be run for a time from swtch(), processes must first be queued on a run queue. The setrq() 
routine (see Listing Five , page 120) places them on the end or tail of the run queue associated with the 
process’s priority. Priority may change as a process runs, so we have a remrqQ as well (see Listing Five ) 
to remove a process from a run queue. This allows us to reinsert the process with setrqQ at a different 
priority, and thus, a different run queue. 


153 




Alternative Implementations and Trade-offs 

There are a number of other ways to implement this process switching routine. On the 386, at least three 
ways of loading and unloading the registers provide different degrees of functionality at a given cost. 
Also, we can choose to reorder the structure of this routine to minimize the costs for common cases 
(switch to self and switch to idle, for instance). 


Instead of storing and loading the registers with individual move instructions, we could have used a single 
pushal instruction. In that case, we would point the stack at the pcb before issuing the instruction, then 
restore it afterward. Something below the pcb might be written on if we got an interrupt, so we would also 
have to bracket it with instructions to lock out interrupts and reenable them. 

Another way we could store and load registers on the 386 is to use the all-en-compassing JUMP TSS task 
switch instruction (ljrnp to a TSS descriptor). This instruction, unique to the 386, stores all register state in 
a special data structure and loads new state as well from the new process. It even switches the page tables 
and sets the TS bit dealing with the coprocessor. (Does just about everything but walk the dog!) 

To use the JUMP TSS instruction, we would need a TSS descriptor for each process that could be active at 
any time. Also, we would have to detect the case where we might he switching back onto ourselves, and 
avoid using JMP TSS at this point. (This instruction is so helpful, it even tracks whether we arc using the 
very TSS loaded, and gives us an exception if we do that!) This instruction, true to its calling, really does 
it all, storing and loading all registers (sans the coprocessor, thank goodness) all the time. 

RISC fanatics tend to have a field day with CISC instructions of this kind because of the complexity and 
expense. As a matter of fact, while comprehensive, we find it too slow to use efficiently for our purposes. 
However, to be fair to the designers of the 386, if you need to do all of the things that JUMP TSS offers, 
using this instruction is probably your best bet. (At the moment, 386BSD doesn't really need all this 
instruction’s features, but our pcb format allows us to use JUMP TSS in subsequent versions, should it 
become more desirable.) The 386 also supports exceptions and interrupts optionally handled with a 
transparent CALL TSS, an instruction which may also offer some advantage in certain circumstances. 


In comparing our three methods outlined for process switching routines, it is important to sit down and 
add up the instruction costs. We know that by saving fewer registers, we end up doing fewer loads and 
stores, and hence make our end-to-end cost lower. In our 386BSD swteh () function (see Listing Four ), 
we get away with saving only six registers. We don't need to save %eax, %edx, and %ecx because these 
are compiler temporary registers which are discarded on return. We also don’t save the segment registers 
because they don't change in this version of the system. In contrast, pushal saves eight registers and JMP 
TSS saves 20. Adding up the instruction costs, our approach is the best of the three. 


We can also look at structural changes to swteh () itself. For example, instead of our (store, select, load) 
sequence we could try a (select, store, load) sequence. In this example, if we detect the case of switching 
to ourselves (perhaps after an idle wait), we can avoid both the store and load of registers. (This is 
particularly useful if you have a RISC with 30-100 registers or more.) We actually coded the first version 
this way and ran the operating system like this for a year. But due to the paucity of registers on the 386, 
little advantage was gained. 


154 



The simpler arrangement seems to work better, because with (store, select, load), we have more registers 
free to keep the values used by select, thus reducing the number of memory accesses. Also, there is usually 
a delay between when the processor idles and when an interrupt occurs (which will cause a wakeup () and 
then a setrq ()), so the store occurs during this delay time. This in turn makes the delay shorter because 
only the load would need to be completed to finish the swtch ( )! Thus, the average real time elapsed 
would be shorter, because the store is overlapped with the wait! 

Another change one could make is to move the select portion to the setrq () function, and rely on a single 
comparison to determine if a switch or idle needs to be done (having already pre-calculated which one it 
is). But this adds to the complexity, and might not result in a gain, because the places that matter have a 
setrq () call just prior to swtch (). It might even be slower, because there are more setrq () calls than 
swtch () calls. Sometimes clever optimizations just move the problem around rather than improve the 
situation. 

Multiprocessing 

Multiprocessing describes a system capable of managing multiple processors. (It does not mean running 
multiple processes, which is called "multiprogramming.") UNIX kernel paradigms we have mentioned are 
extensible to multiprocessors (with semaphores and effort), because many of the problems we’ve dealt 
with (serialization, deadlock prevention, blocking, and context switching) apply here as well, but on a 
grander scale. With multiprocessors, there’s obviously even more competition for shared data structures, 
as each processor may want the very same data object at the same moment in time. 

The degree to which multiprocessors can be applied is often confusing. In the common case, multiple 
processors can each do a process at a time — parallelism exists at the process level. Such UNIX systems 
have "make" programs that can run multiple, simultaneous compilations. A smaller set of systems 
(including some MACH and Chorus systems) also permit multiple processors to each do a part of a 
process simultaneously. This is a form of "fine grain" parallelism. The mechanisms of multiprocessor 
operation within a single process are facilitated by either threads or lightweight processes. 

Threads are "nanotasks" — in other words, the smallest possible state living in address space. They are 
inexpensive mechanisms added to existing processes. However, most thread programming models use 
different primitives for dealing with threads than for processes. 

Lightweight processes (LWP) are "nanoprocesses" that may share all, or a portion, of an address space. 
They are treated by the system just like processes, but share resources, so they are "lighter" than UNIX 
processes. Lightweight processes use similar or identical primitives to deal with processes in general. 

If you are beginning to notice that the disagreement between these two approaches is one of either being 
"inside out" instead of "outside in," you’re dead on the money. Reading Swift is an excellent exercise for 
those unclear on such conflicts, as the residents of Lilliput and Blefescu well know! 

Adding Multiprocessing to 386BSD 

Current versions of 386BSD have the sole goal of running on a uniprocessor machine, but that’s not to say 
it will always be this way. Should we wish to extend it to multiple processors, we would need to consider 
a number of issues. 


155 



386BSD is a "monolithic" kernel: All its functionality is built in a single program known as the kernel. On 
a uniprocessor, this kernel program is multitasking, in a sense, to provide multiple processes with system 
call functionality. On a multiprocessor, the same program would be present, but would have to support 
multiprocessing as well. 

Among the most significant modifications to 386BSD would be changes to the subroutines that block, 
unblock, and select processes that would serialize access to process state. System call requests from 
processes in 386BSD are always processed with a process pointer, so the state within the process structure 
and the process’s kernel stack would be uniquely accessed by that very processor. Other objects, such as 
the corresponding user process’s address space, files, file descriptors, buffers, and the like, would then be 
arbitrated for by a series of "spin" locks and reference-checked on allocation and deletion. Much of this 
has already been anticipated in the current version of the system, unlike previous editions of BSD. 

How might such extensions allow multiprocessing inside a single process? Well, processes could share all 
or part of an address space, so they could become lightweight by not requiring a complete copy of all of a 
process’s parts, for example, vfork(). (See the sidebar entitled "Brief Notes: Lightweight Processes and 
Threads" in the September issue.) By virtue of "gang" scheduling, a set of shared processes could all 
become active, each individual one per processor. Debugging such an arrangement would be identical in 
form to multiple process debugging. 

Reflection: Why is it Hard to Add Multiprogramming After the Fact? 

UNIX has been around for over 20 years and predates other operating systems which have come (and 
gone) such as CPM, MS-DOS, and Finder. So, why has multiprogramming come to mass market 
platforms such as the PC and Mac so late? Both MS-DOS and Finder were written with some 
multiprogramming "writing on the wall" in mind, but it’s been a long road between "it’s going to be there" 
and "it’s here," and it’s not common yet. 

Windows/386 accomplishes some aspects of multiprogramming (somewhat like UNIX) through a 
hardware trick: by simulating PCs via the virtual 8086 mode, with the actual windows kernel running in 
protected mode. Multifinder (System 7.0) on the Macintosh also attempts multiprogramming, but the price 
is that no safety nets are held out for naughty programs. (In other words, programs were doing things they 
should never have been doing, so they should be changed to work appropriately. This does not go over 
well on the applications circuit, to say the least.) 

What was the problem? The experiences of the past were not heeded in many areas, and an appropriate 
model was not completely thought out. (In addition, the cost of an MMU was considered "too high" for the 
PC and Mac.) 

• These systems did not separate the application program from the kernel, as UNIX has done since its 
early days on the PDP-11/45. This meant that missing functionality in the system could be bestowed 
by a clever applications program, but there was a downside. These applications got far more intimate 
with its internals than the operating system’s designers probably desired. It’s hard to believe that 
MS-DOS redirectors and TSRs were anything but a bad dream to such designers, who had different 
plans for the future. 

• The drivers and other portions of the system were written expecting synchronous operation without 
preemption by the system. This precludes multiprogramming, because the system can’t run another 


156 



program in the idle time waiting for the disk. (Early versions of UNIX suffered this flaw as well.) To 
put this in UNIX terminology, the top half of the system blured together with the bottom half, because it 
didn’t matter with nonmultitasking systems such as MS-DOS and Finder. 

In general, the lessons learned from UNIX (and other operating systems) are often ignored by operating 
systems designers. The desire to "create from the bottom up" can result in short-sighted and incomplete 
designs which are difficult to rectify later. A good operating systems designer should attempt to leverage 
as much as possible from other efforts, sorting out the good from the bad, and then proceeding onwards 
with a new set of goals. After all, we don’t go ahead and design the microprocessor, design and build the 
hardware, port the operating system, and then write the applications programs all at once, do we? (Well, 
we wouldn’t recommend this route, but we have actually done three of the four, and it was not easy.) 

Remember also that shortcuts that appear not to matter often come back to bite, so trade-offs should be 
carefully hashed out before a decision is made. That’s why we discuss why we didn’t do something as 
well as what we did do in 386BSD. These rules are pertinent to all operating system design, not just 
UNIX. 

After understanding the broad implications of multiprogramming, along with the minutiae which make it 
possible, it’s impressive that it’s all based on a simple set of conventions. Through a careful understanding 
of these conventions and how they are implemented, we gain appreciation for how a simple model, 
carefully arranged, can offer much years down the line. 

Onward and Forward 

The mechanics of processes and context switching, coupled with a basic understanding of 
multiprogramming, multiprocessing, and multitasking, form some of the key components of a true UNIX 
system. These fundamental constructs were incorporated into the original design of UNIX, with the result 
that extending it into the realms of multiprocessing, for example, becomes a plausible goal not buffeted by 
contradictory design elements (as in MS-DOS, for example). Any operating system which purports to be 
multiprogramming must meet the definitions and constructs of a multiprogramming system, else it quite 
simply is not. 

As we stated earlier, we are working on many areas of this 386BSD port at once, so we will be returning 
to our main ( ) procedure (see DDJ August 1991) next month to continue our discussion in more detail and 
focus on the primitives and organization which impact device drivers. In particular, we will examine 
important areas such as auto-configuration, the enabling operation of the PC hardware devices, splX() 
(interrupt vector-level management), and the interrupt vector code. The following month, after having laid 
the groundwork for our UNIX device drivers, we will discuss sample device drivers. In particular, we will 
examine in detail some of the code required for the console, disk, and clock interrupt drivers. The basic 
structure of these drivers, minimal requirements, and extending the functionality through procedures such 
as disklabels will also be discussed. 

[LISTING ONE] 

/* code fragment from i386/trap.c (in trap() and syscallO) */ 

if (want_resched) { 

/* 

* Enqueue our current running process first, so 


157 



* that we may eventually run again. Block clock 

* interrupts that may interfere with priority 

* (e.g. we'd rather it not be recalculated part 

* way thru setrun). 

*/ 

(void) splclockO; 
setrq(p); 

(void) splnoneO; 
p->p_stats->p_ru.ru_nivcsw++; 
swtch(); 

while (i = CURSIG(p)) 
psig(i); 

} 


[LISTING TWO] 

/*- 

* Copyright (c) 1982, 1986, 1990 The Regents of the University of California. 

* Copyright (c) 1991 The Regents of the University of California. 

* All rights reserved. 

*/ 

/* 

* General sleep call. 

* Suspends current process until a wakeup is made on chan. 

* The process will then be made runnable with priority pri. 

* Sleeps at most timo/hz seconds (0 means no timeout). 

* If pri includes PCATCH flag, signals are checked 

* before and after sleeping, else signals are not checked. 

* Returns 0 if awakened, EWOULDBLOCK if the timeout expires. 

* If PCATCH is set and a signal needs to be delivered, 

* ERESTART is returned if the current system call should be restarted 

* if possible, and EINTR is returned if the system call should 

* be interrupted by the signal (return EINTR). 

*/ 

tsleep(chan, pri, wmesg, timo) 
caddr_t chan; 
int pri; 
char *wmesg; 
int timo; 

{ 

register struct proc *p = curproc; 
register struct slpque *qp; 
register s; 

int sig, catch = pri & PCATCH; 
extern int cold; 
int endtsleepO; 

s = splhigh(); 
if (cold || panicstr) { 

/* 

* After a panic, or during autoconfiguration, 

* just give interrupts a chance, then just return; 

* don't run any other procs or panic below, 

* in case this is the idle process and already asleep. 

*/ 


158 



splx(safepri); 
splx(s); 
return (0); 

} 

#ifdef DIAGNOSTIC 

if (chan == 0 || p->p_stat != SRUN || p->p_rlink) 
panic("tsleep"); 

#endif 

p->p_wchan = chan; 
p->p_wmesg = wmesg; 
p->p_slptime = 0; 
p->p_pri = pri & PRIMASK; 

/* Insert onto the tail of a sleep queue list. */ 
qp = Sslpque[HASH(chan)]; 
if (qp->sq_head == 0) 
qp->sq_head = p; 

else 

*qp- > sq_tailp = p; 

*(qp->sq_tailp = &p->p_link) = 0; 

/* 

* If time limit to sleep, schedule a timeout 
*/ 

if (timo) 

timeout(endtsleep, (caddr_t)p, timo); 

/* We put ourselves on the sleep queue and start our timeout 

* before calling CURSIG, as we could stop there, and a wakeup 

* or a SIGCONT (or both) could occur while we were stopped. 

* A SIGCONT would cause us to be marked as SSLEEP 

* without resuming us, thus we must be ready for sleep 

* when CURSIG is called. If the wakeup happens while we're 

* stopped, p->p_wchan will be 0 upon return from CURSIG. 

*/ 

if (catch) { 

p->p_flag |= SSINTR; 
if (sig = CURSIG(p)) { 

if (p->p_wchan) 
unsleep(p); 
p->p_stat = SRUN; 
goto resume; 

} 

if (p->p_wchan == 0) ( 

catch = 0; 
goto resume; 

} 

} 

/* Set process sleeping, go find another process to run */ 
p->p_stat = SSLEEP; 
p->p_stats->p_ru.ru_nvcsw++; 
swtch(); 

resume: 


159 



splx (s); 

p->p_flag &= -SSINTR; 

/* cleanup timeout case */ 
if (p->p_flag & STIMO) { 
p->p_flag &= -STIMO; 
if (catch == 0 || sig == 0) 

return (EWOULDBLOCK); 

} else if (timo) 

untimeout(endtsleep, (caddr_t)p); 

/* if signal was caught, return appropriately */ 
if (catch && (sig != 0 || (sig = CURSIG(p)))) { 

if (p->p_sigacts->ps_sigintr & sigmask(sig)) 
return (EINTR) ; 
return (ERESTART) ; 

} 

return (0) ; 

} 

[LISTING THREE] 

/*- 

* Copyright (c) 1982, 1986, 1990 The Regents of the University of California. 

* Copyright (c) 1991 The Regents of the University of California. 

* All rights reserved. 

*/ 

/* Wakeup on "chan"; set all processes 

* sleeping on chan to run state. 

*/ 

wakeup(chan) 

register caddr_t chan; 

{ 

register struct slpque *qp; 
register struct proc *p, **q; 
int s; 

s = splhigh(); 

qp = Sslpque[HASH(chan)]; 

restart: 

for (q = &qp->sq_head; p = *q; ) { 

#ifdef DIAGNOSTIC 

if (p->p_rlink || p->p_stat != SSLEEP && p->p_stat != SSTOP) 
panic("wakeup"); 

#endif 

if (p->p_wchan == chan) { 
p->p_wchan = 0; 

*q = p->p_link; 

if (qp->sq_tailp == &p->p_link) 
qp->sq_tailp = q; 
if (p->p_stat == SSLEEP) ( 

/* OPTIMIZED INLINE EXPANSION OF setrun(p) */ 
if (p->p_slptime > 1) 
updatepri (p); 
p->p_slptime = 0; 


160 



p->p_stat = SRUN; 
if (p->p_flag & SLOAD) 
setrq(p); 

/* 

* Since curpri is a usrpri, 

* p->p_pri is always better than curpri. 

*/ 

if ((p->p_flag&SLOAD) == 0) 
wakeup((caddr_t)SprocO); 

else 

need_resched(); 

/* END INLINE EXPANSION */ 
goto restart; 

} 

} else 

q = &p->p_link; 

} 

splx (s); 

} 

[LISTING FOUR] 

/* Copyright (c) 1989, 1990, 1991 William Jolitz. All rights reserved. 

* Written by William Jolitz 6/89 

★ 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

*/ 

/* Swtch() */ 

ENTRY(swtch) 

incl _cnt+V_SWTCH 

/* switch to new process, first, save context as needed */ 
movl _curproc, %ecx 

movl P_ADDR(%ecx), %ecx 

/* unload processor registers, we need to use them */ 

movl (%esp),%eax 

movl %eax, PCB_EIP(%ecx) 

movl %ebx, PCB_EBX(%ecx) 

movl %esp, PCB_ESP(%ecx) 

movl %ebp, PCB_EBP(%ecx) 

movl %esi, PCB_ESI(%ecx) 

movl %edi, PCB_EDI(%ecx) 

/* save system related details */ 

movl $0,_CMAP2 /* blast temporary map PTE */ 

movw _cpl, %ax 

movw %ax, PCB_IML(%ecx) /* save ipl */ 

/* save is done, now choose a new process or idle */ 


161 



rescanfromidle: 

movl _whichqs,%edi 

2 : 

bsfl %edi,%eax /* found a full queue? */ 

jz idle /* if nothing, idle waiting for some */ 

/* we have a queue with something in it */ 
btrl %eax,%edi /* clear queue full status */ 

jnb 2b /* if it was clear, look for another */ 

movl %eax,%ebx /* save which one we are using */ 

/* obtain the run queue header */ 
shll $3,%eax 

addl $_qs,%eax 

movl %eax,%esi 

#ifdef DIAGNOSTIC 

/* queue was promised to have a process in it */ 

cmpl P_LINK(%eax),%eax /* linked to self? (e.g. not on list) */ 

fje panicswtch /* not possible */ 

#endif 

/* unlink from front of process q */ 

movl P_LINK(%eax),%ecx 

movl P_LINK(%ecx),%edx 

movl %edx,P_LINK(%eax) 

movl P_RLINK(%ecx),%eax 

movl %eax,P_RLINK(%edx) 

/* is the queue truely empty? */ 
cmpl P_LINK(%ecx),%esi 

je 3f 

btsl %ebx,%edi /* nope, set to indicate full */ 

3: 

movl %edi,_whichqs /* update queue status */ 

/* notify system we've rescheduled */ 

movl $0,%eax 

movl %eax,_want_resched 

#ifdef DIAGNOSTIC 

/* process was insured to be runnable, not sleeping */ 
cmpl %eax,P_WCHAN(%ecx) 

jne panicswtch 
cmpb $ SRUN,P_STAT(%ecx) 

jne panicswtch 
#endif 

/* isolate process from run queues */ 
movl %eax,P_RLINK(%ecx) 

/* record details of newproc in our global variables */ 

movl %ecx,_curproc 

movl P_ADDR(%ecx),%edx 

movl %edx,_curpcb 

movl PCB_CR3(%edx),%ebx 


162 



/* switch address space */ 
movl %ebx,%cr3 

/* restore context */ 
movl PCB_EBX(%edx), %ebx 

movl PCB_ESP(%edx), %esp 

movl PCB_EBP(%edx), %ebp 

movl PCB_ESI(%edx), %esi 

movl PCB_EDI(%edx), %edi 

movl PCB_EIP(%edx), %eax 

movl %eax, (%esp) 

#ifdef NPX 

/* npx will interrupt next instruction, delay npx switch till then */ 
#define CR0_TS 0x08 
movl %cr0,%eax 

orb $CR0_TS,%al /* disable it */ 

movl %eax,%cr0 

#endif 

/* set priority level we were at last time */ 
pushl PCB_IML(%edx) 

call _splx 

popl %eax 

movl %edx,%eax /* return (1); (actually, non-zero) */ 

ret 

/* When no processes are on the runq, Swtch branches to idle 

* to wait for something to come ready. 

*/ 

.globl Idle 

Idle: 
idle: 

call _spl0 

cmpl $0,_whichqs 

jne rescanfromidle 

hit /* wait for interrupt */ 

jmp idle 

[LISTING FIVE] 

/* Copyright (c) 1989, 1990 William Jolitz. All rights reserved. 

* Written by William Jolitz 7/91 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 

*/ 

/* 

* Enqueue a process on a run queue. Process will be on a run queue 

* until run for a time slice (swtch()), or removed by remrq(). 

* Should only be called with a running process, and with the 

* processor protecting against rescheduling. 


163 



*/ 

setrq(p) struct proc *p; { 

register rqidx; 
struct prochd *ph; 
struct proc *or; 

/* Rescale 256 priority levels to fit into 32 queue headers */ 
rqidx = p->p_pri / 4; 

#ifdef DIAGNOSTIC 

/* If this process is already linked on run queue, we're in trouble. */ 
if (p->p_rlink != 0) 

panic("setrq: already linked"); 

#endif 

/* Link this process on the appropriate queue tail */ 

ph = qs + rqidx; 

p->p_link = (struct proc *)ph; 

or = p->p_rlink = ph->ph_rlink; 

ph->ph_rlink = or->p_link = p; 

/* Indicate that this queue has at least one process in it */ 
whichqs |= (l<p_pri / 4; 

#ifdef DIAGNOSTIC 

/* If a run queue is empty, something is definitely wrong */ 
if (whichqs & (l<p_link->p_rlink = p->p_rlink; 

p->p_rlink->p_link = p->p_link; 

/* If something is still present on the queue, 

* set the corresponding bit. Otherwise clear it. 

*/ 

ph = qs + rqidx; 
if (ph->ph_link == ph) 

whichqs &= ~(l<p_rlink = (struct proc *) 0; 

} 

Porting Unix To The 386: The Basic Kernel: Device 
autoconfiguration 

How 386BSD discovers hardware devices that are present and configures itself for operation with those 
devices. 

Device autoconfiguration 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. (c) 1991 TeleMuse. 


164 



Last month we examined the mechanics of processes and context switching. Coupled with a basic 
understanding of multiprogramming, multiprocessing, and multitasking (see DDJ, September 1991), we 
have now covered one of the fundamental tenets on which our 386BSD operating systems kernel relies 
and on which everything else is built. With this, we have conquered the "first pitch" of our mountain. 

In essence, we can consider our examination of multiprogramming and multiprocessing and the details of 
swtch() to be analogous to an examination of our map (concepts) and a careful laying of anchors before 
we climb up and over a treacherous overhang. Why an overhang? Because a cavalier approach to these 
basic elements could result in a misdesign which causes a great fall later. Witness the difficulty in getting 
other operating systems to accomplish what UNIX was designed to do from the first. 

However, it is time to make tracks and cover new ground. We are now working on many areas of the 
386BSD port at once, so we must return to our main() procedure (see DDJ, August 1991) and focus on 
the organization and primitives which impact device drivers. In particular, we need to understand the 
concepts necessary to the integration of appropriate device drivers. We examine the UNIX concept of 
"device interface," the layout and terms used in device drivers, and how BSD works the miracle of 
autoconfiguration. We also examine how our BSD kernel interfaces with its device drivers. 

Next month, we will examine actual driver operations. Then, after laying the groundwork for our UNIX 
device drivers, we will discuss some sample device drivers. 

Re-examining Our Framework: Kernel Services 

In our previous articles on machine-dependent (DDJ, July 1991) and machine-independent (DDJ, August 
1991) initialization, you might have noticed that we completely bypassed a significant area — I/O device 
initialization, otherwise known as "automatic configuration" or autoconfiguration. This was done 
intentionally so that we could present a clear introduction to the basic operating arrangement of our BSD 
kernel without gorging on UNIX trivia. 

By describing the basic framework of kernel services prior to I/O devices, we actually chronicled this port 
as it happened. We took this approach because by using portions of the kernel services to debug and/or 
bypass problems we encounter with the device drivers, we make a lot less work for ourselves. When 
needed, we could build a debugging framework around a targetted problem area, focus on it, try 
alternatives, and resolve it to conclusion. 

In other words, at every point along the climb, we have attempted to belay ourselves against the 
foundation of work we have built. (The question now becomes "Was the mountain there to climb, or did 
we build the mountain as we climbed it?" Zen philosophers and systems programmers can debate this 
question at their leisure.) The further we delve into the system, the greater the possibility of a catastrophic 
misstep, so our anchors (tools) must be carefully placed to prevent us from minor falls. Our previous work 
will now form the basis for our current work on drivers. 

And while there are many heartbreaks (and other breaks) which result from falling, nothing is quite as 
sweet as conclusively putting the finger on an obstinate nine-month-old "bug" that has played 
hide-and-seek through your most relentless attempts. ("That which does not destroy us, makes us strong." 
—Nietzsche.) 


165 



UNIX as the Device Driver Interface 


Over the course of integrating drivers into an operating system, a programmer unversed in systems can be 
intimidated by the device interface problem. The common approach is to try and "glue" an arbitrarily 
designed driver onto the side of the kernel and attempt to minimize the interface to the kernel, perhaps by 
doing everything in the driver. This half-hearted approach may result in a (somewhat) working product, 
but it does not lead to efficient and correct design and operation in all cases. Given the frequency in which 
this is done, it’s no wonder that drivers are frequently considered a black art. They are never truly finished 
or fully debugged. ("If carpenters built homes the way programmers write programs, then the first 
woodpecker to come along would destroy civilization.") 

Another approach is to actually reverse your perspective and consider the entire problem as a "bag of 
drivers" with UNIX as the pervasive interface to them (see "Brief Notes: UNIX — lust a ’Bag of 
Drivers?’). In other words, hold UNIX as the given constant and mold the driver design to suit. This is 
somewhat unorthodox, but can be quite instructive. 

So, instead of dealing with the driver as an independent entity, we take a broader view of the kernel’s 
interfaces and services provided for the drivers’ use. We can then leverage this knowledge of the kernel to 
illustrate the methodology of how the kernel’s rich set of services can be lithely used to integrate device 
drivers. This approach actually fits in quite nicely with the heritage of UNIX device-driver integration. 

Then What is a Device Driver? 

Now that we have shifted our perspective of UNIX, we should really sit down and define our terms 
carefully. In general, the term "device driver" refers to the software that operates a device. Obvious 
enough, right? 

It’s when we try to get specific that we run into trouble. For example, if we extend the definition of device 
to imply control of a "hardware device," we find that we have now excluded many drivers that function 
entirely in software. These "software devices" are used to mimic a device-driver interface to simulate the 
effect of the desired "pseudo-device," such as /dev/pty. (Pseudo-ttys, which simulate terminal drivers, are 
used when logging in over the network with a telnet or rlogin session.) Other device drivers can redirect 
references to yet another driver elsewhere in the kernel, bypassing the "normal" reference. The /dev/tty 
device, for example, always refers to the terminal the process is currently associated with, even though this 
may be different for most processes on the same system. 

In systems other than UNIX, device drivers can vary in role, responsibility, and form. Under MS-DOS, we 
can have drivers implemented in the BIOS as loadable files (for example, ANSI.SYS) or as TSRs (most 
mouse drivers). Under Mach 3.0, device drivers run outside the kernel in separate processes, as entities 
completely separate from the operating system. 

For our purposes, a driver is a set of functions, compiled into the kernel when it is generated, that connect 
to the driver interface mechanism. Generally, the functions of a driver are all kept in a single source file, 
and there is one driver per device. Frequently, the part of the device that the computer directly interacts 
with is called the "controller," and it may have more than one physical device. If the devices can operate 
autonomously during operations to a degree, they are called "slaves," because they share responsibility 
with the controller "master" for the transfer, unlike "dumb" devices that have a trivial role. 


166 



What are Drivers Made of? 


Device driver are usually responsible for all aspects of device recognition, initialization, operation, and 
eiTor recovery. Because the devices may be mounted on a hierarchy of buses and rely on interrupt 
mechanisms of the processor, they interact with many machine-dependent and bus-dependent support 
functions. Many times, the characteristics of the support functions are so different between different 
computers (such as the Mac and a PC or workstation) that drivers for similar controllers look radically 
different. 

The required intimacy with the system and the architecture is one reason that driver code is reinvented all 
the time (the "have it your way" method gone mad). Even UNIX drivers on the same architecture may 
require significant rework to port them between different flavors of UNIX (such as SVR3, SVR4, MACH, 
and BSD). The choice of drivers in 386BSD (as in other UNIX ports), was significantly affected by our 
ability to leverage other drivers present on the same architecture. 

Leveraging Other Drivers 

Sometimes, when there is a good match between the needs of a porting project and those of a reliable and 
well-written "old" driver, it can be leveraged with a minimum of effort. We can then put all our efforts 
into refining something of demonstrated value rather than reinventing the wheel. 

Frequently, however, there is little in common between the two, and trying to glue the old code into the 
new system becomes more trouble than writing one from scratch. Worse yet, an "old" driver may purport 
to be more than it is, by claiming to support functionality that has not been tested, although on the surface 
it may seem to at least pay lip service to needful areas. In fact, we have seen many such half-hearted 
drivers, and very few that methodically set out to extensively support the equipment. The reason is 
obvious: The drivers are finished, as far as the programmer is concerned, and never looked at again. 

You can assess drivers by looking for the hallmarks: structure, form, history, organization, content, 
correctness, and clarity. The hardest hallmark to judge, pragmatics of design and appropriate 
implementation, generally must be borne out through trial of the software. Being a judge of software is as 
difficult as being a judge of character. 

For 386BSD, we assessed two strategies for leveraging past work. The first was to translate driver requests 
into a series of BIOS commands, then support a mechanism to temporarily enter real mode to allow the 
BIOS ROMs to satisfy those requests. The value of this approach would be to obtain 100 percent 
compatibility with any PC-based system (MS-DOS has enforced this from day one). Had this been strictly 
a commercial effort, this strategy might have been satisfactory. For hard disks and display adapters, the 
BIOS mechanism has been quite successful in mitigating hardware configuration problems for users. 


More Details. 


However, items important to a researcher using 386BSD, such as tape backup, networking, and serial 
communications, were not anticipated in the initial BIOS plan, because at the time, these things were 
believed to be far in the future. Also, IBM really only got serious about support for protected-mode BIOS 
with the PS/2 ABIOS, so even trying to leverage some of the BIOS requires the ticklish matter of 
switching from protected mode, and maintaining a context for the non-multitasking BIOS to run in while 
multitasking is going on around it. Clearly, this would require a colossal kludge, as the BIOS was never 


167 




intended for anything but the vagaries of MS-DOS. 


So, although there were tons of MS-DOS driver software available, we ultimately found little usable code 
without going well out of the scope of the project and markedly altering our specification goals (see DDJ, 
January 1991). To top it all off, our performance would be shot to hell, because code written for a 16-bit 
machine with 64-Kbyte segments doesn’t leverage a 32-bit machine with a 4-gigabyte flat address space 
very well. Having already learned more than we ever wanted about ISA and 386, we had no incentive to 
add BIOS and DOS trivia as well. Thus, we bid a fond farewell to this strategy, fearing that the machine 
might become obsolete before it was fully mastered! 

For the second strategy, we looked at drivers contributed to Berkeley which ran under UNIX on VAX, 
HP300, NS32000, and 386 PC machines. From this source, we were able to satisfy more than half of our 
initial driver requirements, and base our system on software that had some history of operating on another 
platform for a period of time. We could also pick and choose among a number of drivers for some devices. 
Ironically, the better drivers came from the less well-known machines. 

Categories of Device Drivers 

The BSD operating system’s kernel broadly interacts with its device drivers, depending on the kind of 
device and the nature of information it provides. Unit record devices, such as keyboards, terminals, 
modems, and printers tend to fall into one category of device drivers. Mass storage devices such as tape 
drives and hard and floppy disks fall into another category. A third category includes packet transfer 
devices such as network interfaces (Ethernet and token ring controller boards, for example). Bitmap 
display frame buffers could be considered yet another category. 

Often, we would like our system to vary the ways we might configure or interact with these devices, 
depending on need. For example, the point of disk drives is to store and organize both small and large 
collections of data or programs, so it is inconvenient to interact with the disk drive on its terms alone (disk 
sector address and sector data contents). Therefore, we impose an abstraction which allows us to name (or 
key) collections of data as a file. This file system abstraction is the principle way programs make use of 
the disk. We still need to have mechanisms to access the disk as a whole, however, if for no other reason 
than to manage and maintain the file system (for instance, check consistency, backups, file recovery). 

We could use a file system to organize a tape drive as well, and it might work, provided we don’t mind 
waiting minutes for a file. However, tapes are more commonly used as archives and thus we impose on 
top of the tape data record formats, sometimes variable sized and with special hardware-generated records 
to denote file separators (or file marks) and end of tape indications. 

Unit record devices have little in the way of data structure. The application program pushes data bytes 
through them for the desired effect. For the convenience of the applications programs, the system provides 
for a variety of mechanisms to facilitate optional input and output processing. Among these mechanisms is 
a kind of "super" or metadriver, called a "line discipline." The line discipline acts as an intermediary 
between the device driver and the operating system. The most common of the line disciplines is the "tty 
driver," which implements the semantics of the UNIX keyboard interface (that is, backspace, line kill, 
interrupt/suspend a process) for the user. 


168 



Network devices are quite different in nature. Incoming and outgoing packets are structured in elaborate 
and (usually) hierarchical ways. Not only is their content important, but so is the time and means by which 
they arrive. Also, unlike the other categories cited, a single data record may end up going to one of many 
different destinations, and this may be dynamically altered as the system software changes routing 
policies. Thus, the kernel’s device interfaces may look quite different from the other categories. 

Accomplishing bitmap graphics is reflected in another I/O interface need. In this case, we must regulate 
access to the frame buffer’s physical memory by arranging to map the memory into an application’s (such 
as an X server) virtual address space. 

Each of these categories interfaces to a different portion of the BSD kernel. Disk drives are interfaced into 
the file system of the kernel and into "device special files" (found in /dev), which allow utility programs to 
bypass the file system. These files are, in effect, trap doors out of the single UNIX file namespace and into 
a given device driver. Device-special files also allow devices in general to be operated by applications and 
utility programs. Network devices are connected to the network protocol processing mechanism and are 
only visible through the network software interface mechanisms. Thus, network devices don’t show up 
among the device-special files. 

BSD Autoconfiguration Goals 

Versions of UNIX prior to 3BSD had a rather fixed notion of configuration; systems were conditionally 
compiled for a given set of hardware or by manually altering the configuration flags in the driver. (Usually 
this was done to save on the amount of system code taking up space — this was important if one had as 
little as 256 Kbytes, where every Kbyte counted.) If the driver was not there, but the hardware was, it 
could not be used. Worse yet, different systems had to be created for differently configured systems, even 
if they had minor differences in interrupt vectors, were missing a redundant card, or had conflicting 
controller port assignments. 

Early 4BSD versions introduced a more versatile form of configuration that allowed for runtime 
configuration shortly after the system’s kernel was loaded, but prior to operation of the kernel. The intent 
of this configuration mechanism was to put off wiring-down device-dependent information until the last 
moment, then attempt to discover as much of this information from the hardware itself and apply it to the 
drivers as needed. The prime motivation was to factor out as much of the idiosyncratic configuration 
differences as possible. 

The goal of this work was to minimize the impact of maintaining a diverse number of computer systems 
and peripherals within a single version of the kernel. The more we can achieve with this the better, 
because the sheer volume of different kinds of devices that can be configured with systems now is 
enormous. 

Even more elegant mechanisms to automatically configure the drivers for the given devices present have 
been developed over the course of time. "Autoconfiguration" was an early innovation in Berkeley UNIX, 
and it remains a hallmark of a Berkeley-derived version of UNIX to the present. 


169 



BSD Autoconfiguration Approach 


In our BSD kernel, we implement autoconfiguration by incrementally searching for all devices that might 
be supported by the drivers present in our kernel. This is accomplished by "walking" a table of device 
information to locate devices on our target system and calling a routine in each associated driver, using 
this information, to check for the presence of a given device. If this probe() routine finds a device, the 
driver can be wired into the system by applying the configuration information saved in the table. We can 
inform the driver of this, so that it can adjust its own parameters and "fine-tune" configuration by calling 
an attach() routine in the driver. (In some cases, the attach() routine may find a terminal conflict with the 
attempted device configuration, and may deny the configuration attempt.) 

Sometimes we have a master device that manages a number of slave devices (a disk controller with 
multiple drives, for instance). In such a case, when we find a controller with the probe(), we iterate 
through each possible subdevice that might exist on the controller by means of a slave() routine in the 
driver. If any slave devices are found, the attach() routine is called for each routine so the drive may be 
"wired" into the driver. 

Depending on the computer, it’s possible to do autoconfiguration with varying success. Sometimes, much 
of the device-dependent information can be obtained by the software cleverly manipulating the device to 
reveal how it is attached to the system. At other times, it is nearly impossible to detect the presence of a 
device. Worse yet, a hidden conflict between two mutually exclusive devices could cause them to interfere 
with each other. (This happens all too easily on the ISA bus.) 

As a result of the configuration pass, a manifest of devices and related configuration information is tallied 
on the console device, so that an operator can observe what the kernel was able to find and make use of. 
This can be of great use in diagnosing dead equipment, especially if either a device known to be present in 
the computer fails to respond, or if a device known to be missing mysteriously shows up in its place. 

Alternative Autoconfiguration Approaches 

BSD’s current autoconfiguration scheme is rigidly top-down, not unlike that of a recursive descent parser. 
To begin with, all buses directly connected to the computer are probed successively. While examining 
each bus, all devices on a given bus are summarily probed, and in turn, all slave devices on a given 
controller device. But this approach has some drawbacks; we may not yet have all the device information 
at the time we succeed in doing the probe() for a device to attach() it then. 

An alternate solution suggested by Chris Torek (LBL) is to change this arrangement and instead do 
successive "depth-first" probe( )s on all lower-level objects to discover all information about the device 
and its hidden requirements before committing to the corresponding attach! )• Thus, a more complete 
picture of a device’s demands and conflicts can be obtained before we commit to attaching the device. 

Yet another possibility might be using a two-pass, or "bottom-up" method, in which all devices, resources, 
and dependencies are found on the first pass in a kind of "survey" expedition. Having gathered a complete 
picture of system requirements, the second pass assembles the pieces as if they were Lego blocks, 
incrementally attaching them from peripheral to controller to bus to driver. A device can be said to exist 
by its driver if a complete, connected path is available. 


170 



Note that with a complete description of dependencies by either of these mechanisms, we don’t need to tie 
down the processor’s interrupts, special equipment requirements, or other resources — except when we 
actually open the device — so we don’t have to configure solely at boot time. Thus, we could change 
drives with the driver file closed, and when it reopens, the system will discover the change and adapt 
accordingly. 

When Autoconfiguration Comes into Play 

The current BSD kernel manages to locate and configure devices upon boot-up because it must find (at 
least) the characteristics of the root file system, paging store, and console device, so that it can begin the 
most basic operation. Because it has to do all that, the reasoning goes, it might as well find everything 
else. This is adequate for most purposes, but should you wish to reconfigure a SCSI tape drive, for 
example, it’s a bit of a pain to reboot the system. (Actually, configuration should be done on device 
"open" as well as during boot-up, but this is a lot of work to do correctly and hence is usually not done.) 

Information Required for Autoconfiguration 

Autoconfiguration does not stop with just finding the device. More than half the battle is accumulating all 
the information possible about the device, in order to properly attach it. As an example, let’s try to capture 
a general list of possible information desired. This should extend beyond the needs of the ISA bus, 
because we may need to consider other buses. 

How and Where to Find Devices 

Devices are usually found on a bus of some kind. In fact, it is not unusual for a computer to have more 
than one bus, or even buses of more than one type. EISA bus, for example is a kind of bus within a bus, 
with ISA devices working by one set of rules and full EISA cards working by a completely different set 
(for example, slot-independent vs. slot-relative). Thankfully, less common is a hierarchical bus 
arrangement, where bus adapters themselves are devices on buses. (There are DEC VAX machines that 
use this to a depth of two or three.) In these cases we need to know the description of finding the I/O port 
or memory-mapped control and status registers of the given device. We may also need to locate the shared 
buffer memory that display adapters and network interfaces may require. Some bus facilities imply sharing 
or arbitration among devices; thus, special care must be taken to avoid conflicts between devices. 

Device Signalling 

The processor interrupt mechanisms, which usually differ with each style of bus, must be determined. 
Many new devices that support shared use among multiprocessors, or that have multiple data streams 
(such as disk arrays), possess hardware "mailbox" mechanisms to report their progress as they complete 
lists of operations that the driver may have in progress. As we demand higher aggregate data rates, the 
complexity of our hardware I/O system may require more elaborate mechanisms to synchronize the 
hardware with software, and these will necessarily need to be configured and managed by our operating 
system. 


171 



Bulk I/O Facility Usage 


For mass data transport, we may need to find and allocate DMA facilities, which may be in the form of 
channels or dedicated buses. Some of these may require conflict mitigation and perhaps (in the future) 
bandwidth reservation. Some facilities also require address translation, as we take a large, logically 
contiguous transfer and scatter/gather it to a group of data pages (seemly) randomly disposed around the 
system. 

Device Characteristics 

We may have a device with no peripherals, dumb peripherals such as printers or terminals, or those with a 
master/slave sharing of responsibility. These devices have configuration-dependent parameters that may 
be set with hard DIP switches or soft configuration mechanisms. (Some manufacturers have caught on to 
the soft configuration approach. Newer Ethernet cards for the ISA, for example, utilize clever mechanisms 
to do this.) Disk drive capacity and geometry must also be determined. Modem peripheral standards such 
as ESDI and SCSI use standard methods to obtain this information. Some devices may have conflicts with 
others (for example, dual ported access of a single drive), and these must be uncovered. The revision of a 
given device and its diagnostics/disaster recovery mechanisms is also important information (for instance, 
does the disk drive use bad sector sparing?). 

Autoconfiguration and Disk Drive Labels 

Within our BSD system, we usually subdivide disk drives into partitions that may contain different kinds 
of file system abstractions — all on the same drive. To describe this and the disk geometry in a 
device-independent fashion, we use a "disklabel" embedded in the data on the drive. The actual location of 
the disklabel may not be standard across all storage architectures, but the contents and use of the 
information in the higher layers above that of the given disk driver itself is identical in all cases. 

The data structure definition of the current BSD disklabel attempts to support a rather diverse group of 
mass storage architectures. As a final part of the autoconfiguration process, the disk driver extracts this 
data structure from the disk drive and adjusts its parameters, including drive partitioning tables, to reflect 
this information. The kernel uses this information to determine which portions of the disk have been set 
aside for paging, which have various file system types, and the underlying physical storage parameters 
implied (such as file system block and fragment size). 

Higher-level Autoconfiguration 

Up to this point, we have only outlined the information that the kernel of the operating system may require 
to configure itself appropriately. Many systems do this low-level configuration well, but few go beyond 
this after the system boots up and configures itself for use. Other configuration procedures, such as finding 
and mounting various file systems, attaching to various computer networks, and generally embedding 
itself into the fabric of the local and regional computer environment, are not usually done. 

However, in this modern era of LANs, enterprise networks, and global internetworks, computer systems 
no longer stand alone. High-level configuration of resources has now become a necessity. As a result, one 
of the current trends in modern computer systems is resource discovery and management. The cost of 
systems management is usually calculated on a per-computer basis, and as personal computers and 


172 



workstations replace dumb terminals, this grows to be a significant factor. 

In addition, as the demand for better applications programs increases, more configuration information 
needs to be maintained per system. At the same time, manufacturers are being forced to grant more 
autonomy to computer usage groups and move away from the centralized MIS-management mentality that 
made the trains run on time. Managing what one consortium describes as the "Distributed Computing 
Environment" is going to be quite a challenge over the next few years. 

The ISA in a Nutshell 

Now that we have examined how BSD handles configuration, and understand the interface, we must study 
the other side of the question — how to work a device on a bus. In the 386BSD porting project, the ISA 
bus was chosen for the initial port, as it is the most common bus available. 

Before we can delve into the code, a review of the ISA bus is necessary. A driver’s view of the ISA bus 
reveals the mechanisms we must create to work a device on the bus. 

I/O Ports 

The ISA has an independent I/O bus, separate from its program and data memory bus, that is primarily 
used to twiddle the bits for the control and status characteristics of devices. It consists of 1024 discrete, 
byte-sized "ports," some of which can be accessed in twos as 16-bit-wide operations. Each port may be 
read or written, and a given device usually decodes (or implements) a block of them (8, 16, or 32). Some 
devices function exclusively through the I/O ports — even the most common hard disk controller (which 
relies on "string" instructions that repetitively sequences data through a single port). 

The ISA bus, having mere rudiments of configurability, relies in part on devices being at known port 
locations, and has no mechanism to discriminate conflicting devices that may have overlapping or 
mutually exclusive assignments (for example, it does not work). For those devices which do not have 
standard port addresses, freely assignable zones serve as catch basins in which to place them. Most cards 
have only a handful of alternative port assignments (each a different handful, of course), so avoiding 
conflicts with a fully stocked box can sometimes be a tedious puzzle. (This is often made more interesting 
when a hardware manufacturer cleverly decodes more ports than are documented.) This leads to the 
"scraped knuckles" effect, where the computer’s chassis is laid open, and cards shuffled in numerous 
attempts to find the "holy grail" — the correct combination of DIP switches, hardware options, slots, and 
cables. (All this, while muttering on the 45th attempted power-on, the immortal phrase from Bullwinkle, 
the patron saint of programmers, "This time for SURE!") 

Suffering ISA definitely makes one appreciate EISA or MCA all the more, although ISA systems and I/O 
cards are still being produced in massive numbers. Hard to believe that so much work is still being done 
with a bus that was inspired by the Apple II, technological aeons ago! 

Interrupts 

Devices commonly have one or no interrupts: they rarely have more than one. Again, like I/O ports, there 
are "standard" assignments for common cards, but the situation is a little more desperate here because we 
have far fewer interrupts than ports. Depending on whether we have an XT or AT card, we can have as 
many as 6 or 11 unique choices of interrupts, respectively, out of a net 15 interrupts that the ISA PC fields. 


173 



This selection is usually constrained even more because few cards allow more than a selection of two or 
three different interrupts. Also, each interrupt has a discrete priority above higher numbered ones, so 
choosing a different interrupt can alter the processing order of the interrupt (the lowest numbered ones 
always getting first billing). 

The software has no independent way of ascertaining the association of devices with interrupts, unless it 
compels a device to interrupt when all other devices are forced mute. (This assumes that the device can be 
programmed to interrupt without external stimulus.) For electronics reasons, cards cannot reliably share an 
interrupt. Also, interrupts whose source is too brief to be recorded get unceremoniously deposited onto one 
of two interrupts, each of which may have a device connected as well. (These interrupts do 
"double-duty.") 

I/O Display/Buffer/ROM Memory 

Some devices use a portion of the dedicated region of memory resident on the ISA bus. This region is 
frequently called the "hole," as it slices the machine’s RAM into base and extended memory. Unlike the 
I/O ports mentioned earlier, this memory is not usually used for device control registers, but for various 
other purposes. Display adapters use dedicated regions of this memory to hold their frame buffer (or, if in 
higher resolution mode, a "window" or segment of the frame buffer too big to fit in the "hole"). Network 
controllers often have shared-memory buffers that can be selected to steal a portion of this memory as 
well. Finally, the BIOS ROMs, also present in the hole, scan it to find other device ROMs to supplement 
its functions with. This is how display adapters retain software compatibility — by extending the number 
of display modes available through the BIOS and hiding the actual register programming from view. 
Network and hard disk controller cards use this method to allow for initial loading of MS-DOS off the 
network or SCSI hard disk. As a characteristic of the ISA, this region of memory is apportioned by ad hoc 
rules and is the frequent bane of configuration. 

Direct Memory Access (DMA) 

Various devices implement the direct memory access mechanism of the ISA. Three 16-bit and four 8-bit 
wide DMA slave transfers to a single master are available for dedicated use of cards specifically designed 
to make use of them. An interesting feature of the original PC/AT was that a string instruction to move 
data for the disk controller was faster than the DMA channel, so the disk controller did not even bother to 
implement the DMA channel. Unfortunately, the standards for the ISA have been set by its progenitor, so 
the bandwidth hallmarks of DMA transfer are not present with this bus. Not surprisingly, because of the 
various restrictions, cards using the DMA facilities are not as common as with other computers. As with 
the interrupt facilities, the software has no direct method to determine which card is connected to which 
DMA channel. An even more critical failing for a 386/486 system that uses paging is the lack of a page 
map to do "scatter/gather" to the 4 Kbyte-sized pages that might be located at random physical addresses, 
yet consecutive virtual addresses. The DMA facility only works on consecutive physical memory, so the 
software must improvise a solution. 


174 



386BSD Autoconfiguration Scheme 


Having reviewed the key points of our ISA bus, the question becomes "How do we do autoconfiguration 
for 386BSD?" Luckily, this is not as involved an answer as one might think, because our little 386 ISA 
bus machine is guaranteed to have just a single bus with a maximum of a few handfuls of hardware 
devices that need support. (We only have 8 slots.) 

First, we create a configuration table that allows us to encode the descriptions of where to find the devices 
on the bus, as well as wild card values that require us to go out and compel the device to interrupt to locate 
which interrupt it’s configured for. 

To find interrupts, we program the interrupt controller to allow us to poll the interrupt lines to check for 
activity on a given line when we probe for a device, and we wait for a sufficient period before giving up. 
With some notorious devices, we just wire them into the designated interrupt in the table and go on. For all 
remaining interrupts not found, an interrupt catcher table will reflect them to an error-logging service of 
the kernel, so we can note their occurrence. 

Next we use a probe() entry, locate the device, and "prod" it into optionally generating an interrupt and a 
DMA request. Sometimes this can be subtle to write, because we need to determine if anything at all is 
present with the supplied parameters, yet we don’t want to inadvertently trigger a device we haven’t gotten 
to in our list of autoconfiguration table entries. 

Occasionally, the only way to avoid these conflicts is by ordering autoconfiguration, as in the case of 
display adapters. Backwards compatibility with earlier software was required, so VGA and EGA display 
adapters would decode the older CGA/MDA addresses as a part of the auto-sense feature to support 
software that only knew of the older boards. If we probe for the existence of the boards in an 
oldest-to-newest order, newer boards will respond as older ones, thus confusing the situation. By checking 
in order of newest-to-oldest, we can associate the correct driver with the appropriate board, even though 
there may be some ambiguity. 

As we find devices, we logically connect interrupts and DMA request signals to the associated drivers. 
With interrupts, we point the Interrupt Descriptor Table (IDT) call-gate entry to the assembly language 
stub routine associated with the driver. We then adjust the interrupt mask to disable interrupts for all 
devices in the group to which the driver belongs. (In the future, we will learn more about such interrupt 
groups.) 

To complete the attach of the device, the attach( ) routine in each driver is called to configure the device 
appropriately for operation and to report relevant facts about how the device can be used back to the 
system. Network drivers manage to extract link layer addresses embedded in the cards and inform the 
network protocol portions of the kernel of characteristics. 

Either at the time of attach, or at the subsequent open, disk labels are extracted off of disk drives, and the 
system can be made aware of the kinds of file systems used, including paging areas for virtual memory. 


175 



386BSD Autoconfiguration Limitations 


We’ve described the information our BSD kernel might wish to obtain from the hardware to configure 
devices, and what the ISA has to offer in this regard. The two are far from a perfect match. Much 
information is missing that we would prefer to have, and the situation regarding configuration conflict 
detection between devices seems almost hopeless. But this is assuming we have no hints at all about the 
bus; in fact we do, and we are compelled to use them. 

For the more ancient and problematic cases (such as printer parallel ports that won't generate an interrupt 
unless a printer is attached and ready), we can force the configuration table to assume the interrupt 
associated with the device. Thus, if the printer is detected during a probe, the software will dutifully wire 
down the interrupt vector without verifying that it actually is attached. These limits are primarily due to 
the lack of information available because of the history of the ISA bus. 

Other Buses 

Much of our current strategy has focused around the ISA bus of our target machine. However, there are a 
number of machines which utilize other buses, such as the MicroChannel (MCA), EISA, VME, or other 
non-ISA buses. To implement these bus types, this portion of the system would change greatly. While 
additional buses were outside of the scope of our project, we did not desire 386BSD to be limited solely to 
ISA, so the ISA bus-related code is a configurable option with a defined interface into the kernel. To add 
support for other buses, you can add the functionality in along side the current ISA code and use it as an 
example. 

In the case of EISA (which extends the functionality of the ISA bus for new cards designed to this 
standard), such new code would be interwoven with the existing ISA autoconfiguration mechanism, as 
both would be needed to support old ISA and new EISA devices. For MCA, which uses a completely 
different approach to board configuration and is incompatible with ISA, the autoconfiguration and device 
drivers would be completely separate from the ISA code. 

Next Time 

Now that we have reviewed autoconfiguration and its mechanisms, it is time to move on to actual driver 
operations, such as the enabling operation of the PC hardware devices, splX() (interrupt priority-level 
management), and the interrupt vector code. After this, we will walk through the code of some sample 
drivers, noting the important points in light of our knowledge about BSD autoconfiguration and interfaces. 
We will examine in detail some of the code required for the console, disk, and clock interrupt drivers. The 
basic structure, minimal requirements, and extending the functionality of these drivers through procedures 
such as disklabels will also be discussed. 

Brief Notes: UNIX--Just a "Bag of Drivers?" 

Interfaces are a rather crucial part of an operating system, yet we’ve managed to avoid them up to now. 
How? Well, it wasn’t as hard as you might think. A look at the heritage of operating systems might be 
instructive. 


176 



Many early operating systems were little more than a "bag" of drivers and subroutines to make use of 
them. The operating system provided the "common unifying" interface between hardware resources and 
the applications that consumed them. Initially, these early systems used a handful of physical resources 
(disk blocks, RAM, CPU) packaged as abstractions (files, address space, time slice) which an application 
would obtain and then relinquish to the system, as needed. 

More advanced systems attempted to "multiplex" resources in an effort to manage resources more 
efficiently among a number of competing applications, in order to get the most use out of expensive 
hardware (in other words, amortize the costs over widespread use). As operating systems began to contend 
with networks, data exchange formats and conversion, and standard programming languages, the size and 
extent of a user’s reach extended beyond a single machine. Computers begot more computers, and only 
then did the issues of resource sharing and interface standardization become worthy of notice. 

Because resource sharing/multiplexing was done primarily for cost reasons, and only secondarily for 
convenience and cooperation with other users, it has gotten second shrift from designers and standards 
groups, until now. However, so many conflicting approaches exist that it appears hopeless that there will 
ever be "a standard operating system," let alone "a standard computer architecture." 

The modern bane of technology is that as complexity increases, it stalls to overwhelm and blind us with its 
bulk. As this occurs, we arc required to deal with the more microscopic elements in ever greater orders of 
magnitude. Constant improvement in our algorithms, mechanisms, and paradigms are the only way we can 
ever hope to mitigate this deluge. 

And we have not even mentioned the new demands on the frontiers of development, in which video and 
audio signals, representing hundreds of megabytes per second of bandwidth, need to be channeled, 
processed, and combined for multimedia purposes. Nor have we mentioned the need for cooperative 
multiprocessing applications. 

Future operating systems challenges will be quite different from those of the past. The economics of 
computers no longer require us to use whatever means necessary to save a handful of bytes here and there. 
We can opt for a direction that leads toward increased clarity and scope, instead of recreating a new 
version of the old. Paradigm shift sometimes allows us to take a step back and recognize that the tree 
leaves we were previously staring at really are part of a forest. 

However, even at this stratospheric level, the operating system still retains its original heritage of being a 
"bag of drivers." Damn elaborate drivers maybe, but drivers nonetheless. In a way, we have been 
discussing various aspects of the driver interface all along, because UNIX is the interface. 


—B.J. and L.J. 

Porting Unix To The 386: Device Drivers 

The structure of 386BSD device drivers, interfaces to the operating system, and minimal device drivers for 
the console, disk drive and scheduling clock of the PC. 


177 



Drivers for the basic kernel 

This article contains the following executables: UXDRIVER.ARC 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. (c) 1991 TeleMuse. 

In our November 1991 installment, we discussed I/O device initialization via automatic configuration. 
Unlike predetermined or static configuration, automatic configuration is a powerful mechanism that 
reduces the complexity of different configurations, adjusts the operating system to best make use of 
resources (such as mass storage), and discovers and configures dynamically changing services (such as 
networks). Device initialization is an important element of any operating system — how else could you use 
even the simplest disk drive or floppy or communicate on a network? By incorporating an active 
mechanism into our design, our drivers become a stronger base upon which to build a system. Automatic 
configuration, or autoconfiguration, is a key item which differentiates Berkeley UNIX from other systems, 
and is the hallmark of any Berkeley-derived system. By thinking ahead from the very beginning (for 
example, during initialization), we can anticipate the scope of a porting project. 

Unfortunately, programmers often become so involved in the minutiae of a particular device type that they 
minimize the driver interface itself, leading to inefficient or faulty design. In the case of initialization, the 
driver is usually written only for the singular device desired, rather than in anticipation of future 
requirements. Later, it is altered to suit the dozen or so variations of device eventually used. This approach 
almost always results in more tinkering than actual coding. However, under the gun, many programmers 
just glue the driver onto the side of the operating system and hope it works. In 386BSD, we considered the 
UNIX kernel as merely a pervasive "interface," and the driver as the extension of this interface — the last 
step to contact with the device itself — instead of viewing the driver and UNIX as separate entities. Thus, 
we incorporated the "interface" into our driver design from the beginning and ignored the minutiae 
involved in driver operations which might confuse the issue. 

In November we also introduced some of the vagaries of ISA devices that may be encountered in our 
device driver, and attempted to delineate possible conflicts that autoconfiguration might need to resolve. 
Here we will describe in more detail the support macros and functions of the 386BSD kernel that drivers 
use to work devices, such as splx() (the interrupt priority-level management) and the "interrupt vector 
code." We will also continue to build upon our "UNIX as interface" paradigm — the kernel’s interface to 
drivers — by studying the mechanisms by which we can extend the kernel’s grasp through the driver. 

Finally, we examine some important "points of light" (sorry George — we don’t have a thousand) in our 
sample drivers, including the console, disk, and clock interrupt drivers, especially with respect to BSD 
autoconfiguration and interfaces. We’ll also discuss the basic structure of these drivers, minimal 
requirements, and extending the functionality through procedures such as disk labels. 


178 



Device Drivers Needed for the Basic Kernel 


As a reminder, our goal was to create a "basic" system that could ultimately compile itself, so we could 
use it to accelerate the progress of the port. (The alternative, cross-support, is considerably more tedious 
and error prone because of the communications burden.) 

The drivers in the basic kernel must provide a root file system on a hard disk drive, some kind of terminal 
function, and a rescheduling clock (because UNIX is a multiprogramming system). Though a fully 
functional system will require more than this, we can build a self-supporting system and refine it later. 

Because we desire 386BSD to be available to a broad audience, our choice of devices is relatively fixed. 
Our standard 386 ISA PC contains an AT-styled disk controller and a garden-variety 40-Mbyte disk, the 
same as hundreds of thousands of PCs everywhere. A standard PC keyboard and display adapter are our 
terminal devices, and our rescheduling clock is implemented from the onboard programmable timer on the 
PC’s motherboard. 

Contents of the Device Driver: A Dumping Ground? 

Device drivers frequently suffer from middle-age spread. They accumulate features from the past, both 
bad and good, and grow to gigantic proportions. After a while, they become accumulations of baggage — 
"dumping grounds" for unchallenged code. 

For example, one megafirm’s research lab was stymied in trying to cudgel a driver for a special disk and 
controller. Unfortunately, the lab's entire budget was shot on servers built with these drivers, and the 
researchers were therefore committed to seeing the project through to completion. Over the years, the 
driver had grown to be an obscure 3000-plus line monument that never worked. Yet the firm clung to it in 
the vague hope that with just one more line, it might all work out. 

Finally, the company hired an outside expert in drivers. In a week, a completely new driver was written, 
tested, and completed, using (tiny) fragments of the old driver that were isolated and incrementally 
proven. The new, clearly written driver specifically implemented only the features needed by the server, 
and was a fraction of the size. Because it used a "minimalist" approach, the critical portions of the driver 
stood out in detail. The "crisis" came to a deterministic conclusion. (The servers went online.) 

The moral here is to distrust anything overbuilt and underjustified. And when in doubt, simplify, simplify, 
simplify. With our early drivers, we must enforce Spartan discipline with respect to "featurism." We want 
simple and direct code that provides basic functionality. This is not to say we will be devoid of any 
features — some flexibility is needed because our equipment is not uniform. For example, though the disk 
controller on ISA PCs is almost identical, the disk drives (for example, capacities and geometries) are 
quite different. We would also like to support common portions of different display adapters, so that if we 
need to run on an MDA in a pinch, we can. Finally, we would like to use display editors and buffer kernel 
error printouts on separate screens. All in all, though not elaborate, our needs are something more than 
stone knifes and bearskins. 

It’s not our intention to expand the content of these drivers much. They represent the "default" case of bare 
minimum support. Drivers targeted specifically at a device (say, the VGA) will also be the appropriate 
place for more elaborate functionality (such as bitmap and color palette support), and they will 
autoconfigure ahead of the default display driver. 


179 



Haven’t We Met Somewhere Before? 


Many drivers are essentially "copies" of other drivers, because one controller generally looks pretty much 
like another. However, when the framework of the driver is incompletely or incorrectly replicated, new 
bugs are introduced. 

Terminal drivers are alike. In fact, the early UNIX terminal drivers were so similar that the "pseudo" or 
"super-driver" tty.c was created just to share the common code. Likewise, the 386BSD kernel contains 
support routines for disk and network devices to minimize driver code replication. 

Display and Keyboard Driver 

The Display/Keyboard driver (cons.c, available electronically; see page 3) provides two kinds of output. 
For user processes, it filters character I/O from the keyboard and to the display screen through the tty 
terminal driver. For the kernel, it provides a character output routine used by the kernel to print disaster or 
panic messages. Multiple virtual screen support is implemented to allow separate screens for the kernel, 
user session, and editors. A tiny terminal emulator is present to allow vi, emacs, or jove editors to run on 
the basic kernel. 


Hard Disk Driver 


The hard disk driver (wd.c and wdreg.h, also available electronically; see page 3) supports the AT-style, 
programmed I/O disk controllers (WD100[2347]). The driver reads in a data structure called a "disk label" 
off a known sector (in this case, the second sector of the disk). This allows the driver to configure itself for 
arbitrary drives, because it consults the drive first to obtain information about how to use the drive, before 
any other transfers are attempted. By doing this, one driver covers all possible disk drives (and there are 
many) — in other words, one size fits all! Someone, however, must craft such a disk label and use a 
program to tack it on before the disk can otherwise be used. As you might guess, we get into a "chicken 
and egg" situation here — we need the disk label to be on the drive before we can write it on the drive! 

This is not a problem in practice, because all drives have at least the first 17 sectors in common (for 
example, the first track of the smallest size). So we use a default disk label that corresponds to the smallest 
drive, and overwrite that "logically" when we label the disk to give it its own identity. 


Clock Driver 


The clock driver ( |Listing One| , page 93) is the simplest driver in some ways. We merely need to tickle the 
386 ISA motherboard’s timer device to generate clock pulses and gate to the interrupt control unit every 
cycle. The interrupt itself will enter the kernel’s machine-independent clock processing routine, 
hardclock(), for everything else. 


We will discuss this in detail when we describe process scheduling. We’ll also see how hardclock() 
postpones work to a software interrupt clock processing routine called softclock(), where the work does 
not degrade real-time response. 


180 



Driver Operations 


To get the feel of how the system uses drivers, we need to look at the functional interfaces between 
386BSD and its drivers from the perspective of the system. Note that devices may fit into one of many 
different arrangements when interacting with the system: 

Configuration. 

During system device autoconfiguration, the system probe()s for the existence of a device and attempts to 
attach() it to the system for possible use via the driver. If the device has subdevices, it attaches each of 
these slave()s successively. Because each device/driver combination takes up resources that may interfere 
or interact with other resources, it's the responsibility of the driver(s) to resolve conflicts. 

Normal Devices. 

Most devices are accessed as files so the system can interact with them. During normal use, one needs to 
open() the device, read() or write() to exchange data with it, select operating modes via ioctl(), and 
ultimately close() the device. From the perspective of the system, these events satisfy a given need and do 
not necessarily completely resemble the semantics of ordinary UNIX files. They are similar, but far from 
exact; that's why they are called "special files." 

For example, driver writers are often surprised when they try to use device open and close routines to 
increment and decrement a reference count, respectively. Many UNIX implementations don’t preserve a 
one-to-one relationship here; the "closes," for instance, can outnumber the "opens" by a fair amount (as 
with a disk driver). The reason is that the kernel may view the device through many aliases. The point of 
the open/close routines is to present the semantics of what should be done to the device to put it in a 
consistent state for the requested action. Another surprise is that some drivers only read/write in units of 
integral record size, because they operate with the restrictions of the device underlying the driver. (For 
example, the hard disk controller only operates with integral sector size records.) 

Just as much of the file paradigm matches the given device, so should the open/close/read/write 
mechanisms fit nicely with a record-oriented device, even if it’s just a unit record device. Operating mode 
shifts should be a function of different driver filename, raw/block device partitions, or ioctl modes, so that 
they correspond to a different view or organization of the same information. 

XXintr(). 

Devices with interrupts need a means of asynchronously informing the device driver of the event. 
Typically, this involves using as small a routine as possible to minimize time spent with interrupts masked 
out. The interrupt-driven portion of the driver is the "bottom," or lower part; portions that run on the kernel 
stack of the process that has the driver open are the "top," or upper part. In common use, the top part 
initiates a device operation which causes one to n interrupts to ultimately occur. The top portion then 
sleep()s, and eventually the interrupt routine supplies a wakeupO to allow the top half to finish processing 
the request. 


181 



Special Use. 

Beyond the more obvious device driver entry points, other operations may be less clear. 

To provide a means for loading/ unloading the block buffer cache used in implementing the complex 
UNIX file system, the strategy/) routine of mass storage device drivers encapsulates the methodology for 
bounds-checking the requested transfer strategy/) first enqueus the transfer, then stalls the lead item in the 
queue’s I/O request. Thus, it is possible to sort the queue so the resulting transactions are conducted in an 
order that minimizes a disk drive’s head movement (thus reducing the time spent seeking on the disk). 

Two other driver entry points are of note to disk device drivers: psize/), which returns the size (in blocks) 
of a partition that the system uses to dynamically determine the size of swap space; and dump/), which 
saves a snapshot of physical memory on a partition of the disk (usually the swap space) if the system 
crashes, so we can find the cause of the crash. 

With all special devices, the select/) routine provides the inner primitives of the 386BSD select/) system 
call that scans for activity of a file. mmap() manages to map, via the mmap() system call, the device’s I/O 
memory address space into a portion of the calling process’s address space. Although this is the common 
way a user process (such as an X server) maps in a bitmap display’s frame buffer, it can also be used to 
map in other kinds of device memory for direct manipulation by a user process (such as the shared 
memory of a DSP chip). 

Network Devices. 

Network devices interface to the system differently from other devices. They either send or receive a 
packet of information from a computer network. The packets don’t go directly to a process, but instead 
interact with the protocol modules that implement the necessary processing of a packet. In a sense, the 
model here is more akin to "stimulus-response" than the file abstraction of special devices. It may be that 
protocol processing bounces the incoming message out again without making it a user process (as is the 
case with a redirected packet), or coalesces it with another (as is the case with a transport layer segment), 
or drops it (in the case of a redundant or corrupted packet). 

These protocol modules arc internal to the kernel, and they process the link-level packets that the packet 
drivers send/receive. The network packet driver is concerned with recognizing which kind of link or 
network-level protocol the packet should be sent to, as well as how to address packets from these different 
protocols to the link address of the destination host. As with Ethernet, thousands of network protocols can 
simultaneously use the same wire without interfering with each other. 

These devices have no filename associated with them; instead, they have a name built into the driver itself. 
A network driver is configured for operation by passing it parameters (such as its address) via an init() 
entry point. (It does this by attaching itself into the protocol.) From that point on, protocols may choose to 
direct outbound packets to the device, solely on the basis of address. (For example, don’t select interfaces 
by name but by capability to reach other machines.) Such packets are passed via the output/) routine, 
which wraps a packet from a given protocol into a form suited to the device’s requirements, tacks on the 
appropriate link-layer address, and enqueues the transfer. Because many device classes do this in an 
identical fashion, a support function is available in the kernel (in the case of Ethernet devices, 
ether_output()) that implements the common code. The lead packet on the queue is then passed to the 
driver’s start/) routine, which passes the packet to the device and reclaims the temporary storage assigned 


182 



to the packet. 


Packet reception is a simpler process. Upon interrupt, a packet is extracted from the device in the interrupt 
routine, placed in a freshly allocated portion of temporary storage, and matched to a given protocol by 
means of its incoming address and form. It is then enqueued on the input queue of the appropriate 
protocol, where it will be processed after the conclusion of all remaining interrupt-level code. Because this 
is common processing to many drivers, we also have an ether_input() routine for Ethernet drivers (such as 
the common output routines) to share like processing. 

To gain access to the device while it is operating or to change operating modes, fetch statistics, and so 
forth, an ioctl() entry point allows utility programs to manipulate the device (by name). 

Next Month 

Many devices work on an interrupt-driven basis — they signal an asynchronous event by generating an 
exception, which tells the processor to come and service them. To support this need, we must have the 
ability to enter, exit, and mask various processor interrupts. This topic is fairly complex, deserving 
detailed discussion, and will be taken up next month. 

[LISTING ONE] 

/* [Excerpted from /sys/i386/include/param.h] */ 


#ifndef _ORPL_ 

/* Interrupt Group Masks */ 


extern 

u_short 

_highmask_; 

/* 

interrupts 

masked 

with 

splhigh () 

*/ 

extern 

u_short 

_ttymask_; 

/* 

interrupts 

masked 

with 

spltty() 

*/ 

extern 

u_short 

_biomask_; 

/* 

interrupts 

masked 

with 

splbio() 

*/ 

extern 

u_short 

_netmask_; 

/* 

interrupts 

masked 

with 

splimp() 

*/ 

extern 

u_short 

_protomask_; 

/* 

interrupts 

masked 

with 

spinet() 

*/ 

extern 

u_short 

_nonemask_; 

/* 

interrupts 

masked 

with 

splnone () 

*/ 

asm (" 

. set IO_ 

_ICU1, 0x20 ; . 

set 

IO_ICU2, OxaO "); 





/* adjust priority level to disable a group of interrupts */ 

#define _ORPL_(m) (( u_short oldpl, msk; \ 

msk = (msk); \ 

asm volatile (" \ 

cli ; /* modify interrupts atomically */ \ 

movw %1, %%dx ; /* get mask to OR in */ \ 

inb $ I0_ICU1+1, %%al ; /* get low order mask */ \ 

xchgb %%dl, %%al ; /* switch the old with the new */ \ 

orb %%dl, %%al ; /* finally, OR both it in! */ \ 

outb %%al, $ I0_ICU1+1 ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

inb $ IO_ICU2+l, %%al ; /* next, get high order mask */ \ 

xchgb %%dh, %%al ; /* switch the old with the new */ \ 

orb %%dh, %%al ; /* finally, or it in! */ \ 

outb %%al, $ IO_ICU2+l ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

movw %%dx, %0 ; /* return old mask */ \ 

sti /* allow interrupts again */ " \ 

: "&=g" (oldpl) /* return values */ \ 


183 



/* arguments */ \ 

/* registers used */ \ 


: "g" ((m)) 

: "ax", "dx" 

); \ 

oldpl; /* return the "old" value */ \ 

}) 

/* force priority mask to a set value */ 

#define _SETPL_(m) ({ u_short oldpl, msk; \ 

msk = (msk); \ 

asm volatile (" \ 

cli ; /* modify interrupts atomically */ \ 

movw %1, %%dx ; /* get mask to OR in */ \ 

inb $ I0_ICU1+1, %%al ; /* get low order mask */ \ 

xchgb %%dl, %%al ; /* switch the old with the new */ \ 

outb %%al, $ I0_ICU1+1 ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

inb $ IO_ICU2+l, %%al ; /* next, get high order mask */ \ 

xchgb %%dh, %%al ; /* switch the old with the new */ \ 

outb %%al, $ IO_ICU2+l ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

movw %%dx, %0 ; /* return old mask */ \ 

sti /* allow interrupts again */ " \ 

: "&=g" (oldpl) /* return values */ \ 

: "g" ((m)) /* arguments */ \ 

: "ax", "dx" /* registers used */ \ 

) ; \ 

oldpl; /* return the "old" value */ \ 

}) 

♦define splhigh() _ORPL ( highmask ) 

♦define spltty() ORPL_(_ttymask_) 

♦define splbio() _ORPL_(_biomask_) 

♦ define splimpO _ORPL_(_netmask_) 

♦define spinet() _ORPL ( protomask ) 

♦define splsoftclock() _ORPL_(_protomask_) 

♦define splx(v) ({ u_short val; \ 

val = (v); \ 

if (val == _nonemask_) (void) spl0(); /* zero is special */ \ 

else (void) _SETPL_(val); \ 

}) 

♦endif _ORPL_ 


[LISTING TWO] 

Listing Two is currently unavailable 

Porting Unix To The 386: Device Drivers: Entering, exiting, and 
masking processor interrupts 

Interfaces to the operating system. Entering, exiting and masking processor interrupts. 


184 



Entering, exiting, and masking processor interrupts 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual memory microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. (c) 1992 TeleMuse. 

Many devices work on an "interrupt-driven" basis. That is to say, they signal an asynchronous event by 
generating an exception, which tells the processor to come and service them. To support this need, we 
must have the ability to enter, exit, and mask various processor interrupts. This is fairly complex, and 
merits detailed discussion. 

In the original Bell Labs PDP-11/45 UNIX system, classes of devices shared a particular level of interrupt 
priority (although each had a distinct interrupt vector). For example, priority level 5 was associated with 
terminal controllers, and priority 6 was associated with mass storage and timeslice clock devices. In 
addition, the PDP-11/45 had an spl ("set processor level") instruction that forced the processor itself to the 
given level on demand. This spl mechanism was used to temporarily raise the priority level when critical 
sections of code would be executed. These critical sections would alter the variables or hardware devices 
upon which the interrupt service routines depended, so the affected interrupts would themselves be 
temporarily "locked out" while the world was changed. This procedure prevented inconsistencies arising 
from a race condition between the interrupt code and the critical sections. A typical case is shown in 
Example 1 . 

Example 1: The spl function 


saved priority = spl5(); 

r 

splx (saved_priority) ; 


All spl() functions returned the previous or "old" priority level so that it could be restored by a subsequent 
call to the splx() function, which would then force the priority indicated by its function parameter. 

Most ports of UNIX attempted to preserve or emulate this mechanism, frequently by calling the routines 
that manage spl(). In 4.3BSD, the routines were separated out into more descriptive and less enigmatic 
names. Thus, splclock(), for example, defeated all timeslice clock interrupts, and splbioQ defeated mass 
storage devices that depended on the block I/O (or bio for short) mechanism. 


386 ISA Interrupt Mechanism in Detail 


Our 386 ISA bus machine uses a different arrangement from those outlined above; see |Figure jj . Unlike 


the PDP-11, VAX, and 68000, the interrupt priority control is not part of the processor itself. Instead, it’s 
actually part of the ISA bus. The 386 relies on two 8259 Interrupt Control Units (ICUs) attached to the 
processor, just like an I/O device. So, instead of having special instructions or registers in the processor to 


185 





manage interrupts, the 386 uses I/O instructions on dedicated I/O ports. Additional signals to the 
microprocessor allow the ICUs to indicate the presence of an interrupt and pass back information on 
which interrupt to dispatch to. These ICUs function as a programmable filter of device interrupts, and the 
386 can process these interrupts as presented or ignore them as a whole. 

The ICUs are attached in a cascaded arrangement, with the master ICU directly connected to the 386 and 
the slave ICU connected to one of the eight interrupt input lines that each ICU possesses. Because of this 
layout, although we have two ICUs with eight lines apiece, only 15 interrupts are actually generatable (see 
Listing One , page 90), as the third interrupt (IRQ2) is not allowed. Even more confusing the arrangement 
of relative priority the interrupt priorities for the slave interrupts (IRQ8-15) are jammed in between IRQ1 
and IRQ3. And finally, to maintain compatibility with the original PC, what used to be IRQ2 is now 
attached to the slave to IRQ9, with the newer interrupt signals on the slave ICU (other than IRQ9) 
available only to the AT or 16-bit wide cards. 

Each ICU has three registers full of bits (a bit corresponds to each interrupt) representing its device 
interrupt lines: request, "in service," and mask. Sampled interrupts are present in the request register, and 
interrupts that have been signaled to the processor are recorded in the "in service" register. The mask 
register is managed by the software to selectively disable interrupts as needed — to take them out of the 
running for interrupting the processor. 

Unlike the PDP-11, each interrupt vector has its own interrupt priority level, discrete from all others. Also, 
on the ISA there is no partitioning of priority levels and class of devices. (For example, mass storage 
devices, terminals, and network devices are all spread out and interspersed.) The choice of interrupt vector 
frequently is a function of the given combination of cards present in a system. 

Interrupt Group Masks 

The ICUs provide for quite a variety of programmable modes of operation, as befits a part primarily 
designed for real-time process control applications. We chose to use it in "fixed" priority mode and 
implement our own priority mechanism in software. For this purpose, we created bit mask values that we 
could shove into the ICU mask registers to defeat interrupts associated with each group. Thus, for 
example, when splttyQ is called to block the class of terminal device interrupts, the group mask 

_ttymask_, which has a bit set for each terminal device interrupt to be defeated, is placed in the ICU 

interrupt mask register. This is quite different from the original PDP-11 arrangement! 

A nested condition, where multiple different spls could be called, is a possibility. So, the semantics of the 
spl routines have been altered slightly to be the OR of all masks posted instead of assignment. Thus, if a 
sequence of code called both splttyO and splbio(), the effect would be to defeat both groups of devices 
inclusively. There are two exceptions to this: 1. When we wish to unmask all valid interrupts with 
splnone() (for example, block none of the interrupts); and 2. when we wish to restore the previous state of 
the interrupt control unit’s masks with splx(oldmask). Both of these exceptions need to assign the mask, 
instead of just ORing a value to obtain the desired effect. 

These group masks (see Listing Two , page 90) must be constructed from the interrupts the devices found 
during the autoconfiguration phase. (See DDJ, November 1991.) Lists of devices to be configured for each 
group (such as the tty) are collected. If the given device is found and attached, the interrupt determined to 

be associated with this device (for the tty group,_ttymask_, for example) is added to the appropriate 

group mask. Similarly, a mask of all valid interrupt devices is maintained, so that unused interrupts are 


186 




never enabled. 


To manipulate the interrupt masks (see |Listing Three| , page 90) we use the spl() functions described 
earlier. They are implemented in the GCC inline assembler macro fashion, so that they may be called with 
the smallest additional overhead. As you might guess from this, the spl() functions are called frequently, 
and shortening execution is of value. Please note that two different versions of these macros are provided, 
for use with assembler files and C files, with different calling conventions — again to minimize overhead. 


These functions disable the processor from handling interrupts (cli) so that the mask can appear to change 
"atomically." Also, in deference to high-speed 386 systems, what appear to be spurious input instructions 
arc added after updating the ICU mask register. These instructions do a read of a known "nonexistent" 
port. It is known that no data will be forced on the bus, and that any outstanding output operations on the 
ISA bus will have been written out before this instruction finishes. This code is necessary to mitigate the 
sins of "clever" hardware that holds port output contents in a "write buffer" so that output instructions can 
overlap execution (and it’s still quite slow). If this code is not inserted, when we turn on the processor’s 
interrupt processing again (sti), the new mask won’t have made it to the ICU and we will be running at the 
"old" priority for "a while." If this happens at an inconvenient time (during an interrupt processing routine, 
for instance), we might endlessly recurse and overrun the processor. The 386 is very unforgiving in this 
regard—the processor will shut down and spontaneously reset itself. Equally cleverly, the BIOS will then 
scrub the screen buffer and memory clean of all evidence of what led to the disaster—a debugging 
nightmare. 


One technique commonly used by PC software developers is to attempt to memorize the contents of the 
screen prior to a shutdown. It’s amazing what one can remember during the 100 milliseconds or so that the 
chip is off doing diagnostics. During particularly baffling sessions, different colors on console messages 
and speaker tones were used to convey the content of debugging information, because one could more 
quickly grasp a sequence of tones or colors than the text of a message. This technique, while many times a 
lifesaver, was somewhat difficult to explain to a passerby when the system would hit one of these 
debugging sequences. To the uninitiated, the computer would seem to have gone mad, pouring out 
messages and tones at a frightful rate. In addition, such a person would never quite believe that a 
conclusive answer could be obtained from such an outpouring, because the text would usually be 
incomprehensible, although the arrangement of color and tones would be unique enough to finger the 
guilty party. One might consider it a "poor man’s" in-circuit emulator of sorts, although perfect pitch is not 
usually considered a requirement for systems programming! 


Wiring Interrupts to Drivers 

Our operating system drivers are written for the most part in C, thus leveraging all the benefits of 
programming in a high-level language, such as portability and readability. However, we need to somehow 
connect a device driver’s interrupt function (written in C) to the corresponding hardware interrupt. This 
correspondence might change depending on the whims of configuration, so this process is tied up (like the 
Gordian Knot) to autoconfiguration. 

On the 386, we use a two-step operation. In processing an interrupt, the 386 indexes the IDT (Interrupt 
Descriptor Table) to obtain the address of the interrupt entry stub routine. It then executes the stub to gain 
entry to the given driver. We call this routine a "stub" because its whole existence is dedicated to 
providing a rendezvous point between the machine’s concept of an interrupt and our operating system 


187 



kernel’s higher-level concept of an interrupt (for example, an exception handled by a C function). 

One way to understand this process is to view it more conceptually. The 386 processes an interrupt by 
inserting a "hidden" call instruction before the next consecutive instruction it would have processed 
otherwise. This hidden call is an lcall or intersegment call to the function described by the interrupt’s IDT 
entry. The stub routine adds the necessary semantics (or "glue") to expand this hidden call into a hidden 
driver-interrupt call, and allows the effect to be reversed and the stack cleaned off when the driver 
interrupt routine returns. 

To allow the 386 processor to find the interrupt stubs, we need to fill out the Interrupt Descriptor Table 
entry associated with this. The first 32 entries are used by the processor to deal with its own exceptions, so 
when we initialize the ICUs, we have them relocate up their exception index appropriately. So IRQO-7 
from the first ICU correspond to IDT entries 32-39, and IRQ8-15 from the second ICU correspond to IDT 
entries 40-47. (Note: We don’t receive an interrupt on the third interrupt of the primary ICU because it is 
used as the cascade input by the second ICU. However, it still requires an interrupt table entry [34] as a 
placeholder, even though it is never used.) The function setidt() (see Listing Four page 91) fills out an 
IDT entry with the required details. 

The Interrupt Descriptor Table Entry 

Interrupts on the 386 are extremely CISC-ish; they can be directed to different ring levels (including user 
process ring 3), and can even context switch to another process! No other processor we know of has such 
an elaborate mechanism for interrupt entry. All details needed in the contents of the interrupt descriptor 
(see Listing Five , page 91) reflect this wealth of flexibility built into the 386. The interrupt descriptor can 
be a task gate, an interrupt gate, or a trap gate. The entry can be directed to a separate process, an interrupt 
entry stub (with interrupts disabled), or a trap entry stub (with interrupt status unchanged). Because all 
these choices are gates (MULTICS terminology), embedded in the interrupt descriptor is the selector of 
another segment, described in the GDT/LDT tables. Along with the selector, there is an offset into the 
selector, which locates the entry point within the selector’s segment entry stub. For our needs, the selector 
is always the kernel’s code selector, because with our current arrangement the only place we can deal with 
interrupts is in the kernel. The offset corresponds to the kernel virtual address of the interrupt (or trap) 
entry stub. 

An additional field to note in the interrupt descriptor entry is the descriptor priority level. It is used to 
allow or disallow 386 INT (software interrupt) instructions to gain entry to this interrupt descriptor, as if 
an interrupt had been processed from hardware. On the earlier 8088/8086 (the original PC), these software 
interrupts were indistinguishable from hardware interrupts. In fact, many of the interrupts that MS-DOS 
uses conflict with 386 and ICU IDT assignments. (Luckily for MS-DOS, this is only a problem for 
protected mode and paging.) To prevent software interrupt instructions from user mode processes from 
executing device driver code, the description priority level (dpi) field in the IDT sets the maximum ring 
level allowed to execute this IDT entry with a given INT instruction. 

[LISTING ONE] 

/* [Excerpted from /sys/i386/isa/icu.h] */ 

/* Interrupt enable bits — in order of priority */ 

♦define IRQO 0x0001 /* highest priority - timer */ 


188 




♦define 

IRQ1 

♦define 

IRQ_SLAVE 

♦define 

IRQ8 

♦define 

IRQ9 

♦define 

IRQ2 

♦define 

IRQ10 

♦define 

IRQ11 

♦define 

IRQ12 

♦define 

IRQ13 

♦define 

IRQ14 

♦define 

IRQ15 

♦define 

IRQ3 

♦define 

IRQ4 

♦define 

IRQ5 

♦define 

IRQ6 

♦define 

IRQ7 


[LISTING TWO] 


0x0002 

0x0004 

0x0100 

0x0200 

IRQ9 

0x0400 

0x0800 

0x1000 

0x2000 

0x4000 

0x8000 

0x0008 

0x0010 

0x0020 

0x0040 

0x0080 /* lowest 


parallel printer */ 


/* [Excerpted from /sys/i386/isa/icu.h] */ 

/* Interrupt "level" mechanism variables, masks, and macros */ 

♦define INTREN(s) _nonemask_ &= ~(s) 

♦define INTRDIS(s) _nonemask_ |= (s) 

♦define INTRMASK(msk,s) msk |= (s) 

/* [Excerpted from /sys/i386/isa/isa.c] */ 

/* Configure all ISA devices */ 
isa_configure() { 

struct isa_device *dvp; 
struct isa_driver *dp; 
splhigh(); 

INTREN(IRQ_SLAVE) ; 

/* configure devices, constructing group masks as we go */ 

for (dvp = isa_devtab_bio; config_isadev(dvp,&_biomask_); dvp++); 

for (dvp = isa_devtab_tty; config_isadev(dvp,&_ttymask_); dvp++); 

for (dvp = isa_devtab_net; config_isadev(dvp,&_netmask_); dvp++); 

for (dvp = isa_devtab_null; config_isadev(dvp,0); dvp++); 

/* if we support slip, then any tty interrupt is a potential net intr */ 
♦include "sl.h" 

♦if NSL > 0 

_netmask_ |= _ttymask_; 

_ttymask_ |= _netmask_; 

♦endif 

/* if not enabled, don't allow in ANY mask to become enabled */ 

_biomask_ | = _nonemask_; 

_ttymask_ |= _nonemask_; 

_netmask_ | = _nonemask_; 

_protomask_ |= _nonemask_; 

splnone (); 

} 

/* Configure an ISA device. */ 
config_isadev(isdp, mp) 


189 



struct isa_device *isdp; 
int *mp; 

{ 

struct isa_driver *dp; 
if (dp = isdp->id_driver) { 

/* does this device have any I/O shared memory? */ 
if (isdp->id_maddr) { 

extern int atdevbase[]; 

/* convert from PC absolute physical to virtual */ 
isdp->id_maddr -= IOM_BEGIN; 
isdp->id_maddr += (int)Satdevbase; 

} 

/* "Is there anyone on board?" — Star Trek */ 
isdp->id_alive = (*dp->probe)(isdp); 
if (isdp->id_alive) { 

printf("%s%d" , dp->name, isdp->id_unit); 

(*dp->attach)(isdp); 

printf(" at Ox%x ", isdp->id_iobase); 

/* have we got an interrupt to wire down? */ 
if(isdp->id_irq) { 
int intrno; 

intrno = ffs(isdp->id_irq)-1; 
printf("irq %d ", intrno); 

INTREN(isdp->id_irq); 

/* add to a group mask?? */ 
if(mp)INTRMASK(*mp,isdp->id_irq); 

/* wire interrupt */ 

setidt(ICU_OFFSET+intrno, isdp->id_intr, 

SDT_SYS38 6IGT, SEL_KPL) ; 

} 

/* perhaps a DMA channel request as well? */ 

if (isdp->id_drq != -1) printf("drq %d ", isdp->id_drq) 

printf("on isa\n"); 

} 

return (1); 

} else return(0); 

} 


[LISTING THREE] 

/* [Excerpted from /sys/i386/include/param.h] */ 


#ifndef _ORPL_ 

/* Interrupt Group Masks */ 


extern 

u_ 

.short 

_highmask_; 

/* 

interrupts 

masked 

with 

splhigh () 

*/ 

extern 

u_ 

.short 

_ttymask_; 

/* 

interrupts 

masked 

with 

spltty() 

*/ 

extern 

u_ 

.short 

_biomask_; 

/* 

interrupts 

masked 

with 

splbio() 

*/ 

extern 

u_ 

.short 

_netmask_; 

/* 

interrupts 

masked 

with 

splimp() 

*/ 

extern 

u_ 

.short 

_protomask_; 

/* 

interrupts 

masked 

with 

spinet() 

*/ 

extern 

u_ 

.short 

_nonemask_; 

/* 

interrupts 

masked 

with 

splnone () 

*/ 


asm(" .set I0_ICU1, 0x20 ; .set IO_ICU2, OxaO "); 


/* adjust priority level to disable a group of interrupts */ 


190 



#define _ORPL_(m) ({ u_short oldpl, msk; \ 

msk = (msk); \ 

asm volatile (" \ 

cli ; /* modify interrupts atomically */ \ 

movw %1, %%dx ; /* get mask to OR in */ \ 

inb $ I0_ICU1+1, %%al ; /* get low order mask */ \ 

xchgb %%dl, %%al ; /* switch the old with the new */ \ 

orb %%dl, %%al ; /* finally, OR both it in! */ \ 

outb %%al, $ I0_ICU1+1 ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

inb $ IO_ICU2+l, %%al ; /* next, get high order mask */ \ 
xchgb %%dh, %%al ; /* switch the old with the new */ \ 

orb %%dh, %%al ; /* finally, or it in! */ \ 

outb %%al, $ IO_ICU2+l ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

movw %%dx, %0 ; /* return old mask */ \ 

sti /* allow interrupts again */ " \ 

: "&=g" (oldpl) /* return values */ \ 

: "g" ((m)) /* arguments */ \ 

: "ax", "dx" /* registers used */ \ 

) ; \ 

oldpl; /* return the "old" value */ \ 

}) 

/* force priority mask to a set value */ 

#define _SETPL_(m) ({ u_short oldpl, msk; \ 

msk = (msk); \ 

asm volatile (" \ 

cli ; /* modify interrupts atomically */ \ 

movw %1, %%dx ; /* get mask to OR in */ \ 

inb $ I0_ICU1+1, %%al ; /* get low order mask */ \ 

xchgb %%dl, %%al ; /* switch the old with the new */ \ 

outb %%al, $ I0_ICU1+1 ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

inb $ IO_ICU2+l, %%al ; /* next, get high order mask */ \ 

xchgb %%dh, %%al ; /* switch the old with the new */ \ 

outb %%al, $ IO_ICU2+l ; /* and stuff it back where it came */ \ 

inb $ 0x84, %%al ; /* post it & handle write recovery */ \ 

movw %%dx, %0 ; /* return old mask */ \ 

sti /* allow interrupts again */ " \ 

: "&=g" (oldpl) /* return values */ \ 

: "g" ((m)) /* arguments */ \ 

: "ax", "dx" /* registers used */ \ 

) ; \ 

oldpl; /* return the "old" value */ \ 

}) 

♦define splhigh() _ORPL ( highmask ) 

♦define spltty() _ORPL_(_ttymask_) 

♦define splbio() _ORPL_(_biomask_) 

♦ define splimpO _ORPL_(_netmask_) 

♦define spinet() _ORPL ( protomask ) 

♦define splsoftclock() ORPL ( protomask ) 

♦define splx(v) ({ u_short val; \ 

val = (v); \ 

if (val == _nonemask_) (void) spl0(); /* zero is special */ \ 


191 



else (void) _SETPL_(val); \ 

}) 

#endif _ORPL_ 


[LISTING FOUR] 

/* [Excerpted from /sys/i386/i386/machdep.c] */ 


/* Fill out a gate descriptor uses as an interrupt vector. */ 
setidt(idx, func, typ, dpi) char *func; { 

struct gate_descriptor *ip = idt + idx; 
ip->gd_looffset = (int)func; 
ip- > gd_selector = GSEL(GCODE_SEL,SEL_KPL); 
ip- > gd_stkcpy = 0; 
ip- > gd_xx = 0; 
ip->gd_type = typ; 

ip- > gd_dpl = dpi; /* can we allow INT's to this IDT index? */ 

ip->gd_p =1; /* yup, we're here, don't segment fault me */ 

ip->gd_hioffset = ( (int)func)>>16 ; 

} 


[LISTING FIVE] 

/* [Excerpted from /sys/i386/include/segments.h] */ 


/* Gate descriptors (e.g. indirect descriptors) 

* [The IDT is made up of these as well.] */ 
struct gate_descriptor { 

unsigned gd_looffset:16 ; /* gate offset (lsb) */ 

unsigned gd_selector:16 ; /* gate segment selector */ 

unsigned gd_stkcpy:5 ; /* number of stack wds to cpy */ 

unsigned gd_xx:3 ; /* unused */ 

unsigned gd_type:5 ; /* segment type */ 

unsigned gd_dpl:2 ; /* segment descriptor priority level */ 

unsigned gd_p:l ; /* segment descriptor present */ 

unsigned gd_hioffset:16 ; /* gate offset (msb) */ 

} ; 


Figure 1 


192 



Processor motherboard 
devices: 

IRQO - Tinner (Timeslice Clock) 
IRQ1 - Keyboard 
IRQ8 - Battery Time of Day Clock 
IRQ 1 3 - Coprocessor 


XT 8-bit cards: 

IRQ9 - (formerly IRQ?) 
IRQ3 - <serial port 2> 
IRQ4 - <serial port 1 > 
IRQ5 - <parallel port 2> 
IRQ6 - <floppy disk> 
IRQ7* - <parallel port 1 > 


i AT 16-bit cards 
! have in addition: 
f IRQ10 
f IRQlt 
■f IRQ12 


T IRQ14 <hard disk> 
] IRQ15* 


* IRQ7 and IRQ 15 also receive "lost* interrupts for associated controller 
Figure 1: Bus architecture 


I/O Bus 



386 


INTR 


Primary 

ICO 

(# 1 ) 


Highest 

Priority 


Lowest Priority 


Porting Unix To The 386: Device Drivers: Getting into and out 
of interrupt routines 

Completion of basic 386BSD device drivers. 

Getting into and out of interrupt routines 


193 



















William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual-memory, microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to lynne@berkeley.edu. (c) 1992 TeleMuse. 


For the last couple of months, we’ve been examining device drivers and 386BSD. We pick up where we 
left off last month, focusing on interrupt routines. 


In |Listing Onep page 108), we use a series of macros to implement the interrupt entry stubs, using the C 
preprocessor. Typically, for each system compiled, a configuration program creates this file of interrupt 
stub routines (see Listing Two page 108), which handles interrupt entry (at VwdO in Listing Two ) and 
processes this file to the point where it can be passed off to the driver interrupt-handler function (in this 
case, wdintr). To keep the macros succinct, they are written out of other macros. To keep the code short, it 
is implemented inline with a minimum of branches, although it could have been written as a series of 
subroutines. 


The macros ORPL() and SETPL( ) are nearly identical to the internal macros_ORPL_() and 

_SETPL_(), used to implement the spls. They are used for the same reasons, but in assembler, not C. 

Macro INTR() builds an interrupt entry stub per invocation. The address of these stubs is obtained from 
the IDT table entry that the 386 fetches when processing an interrupt. The first instructions executed when 
handling an interrupt are at the beginning of this macro. We begin by crafting a trap frame suitable for a 
variety of purposes that may result from this incoming interrupt, and saving the processor state so we may 
restore it when we return from the interrupt later. 

After saving the state in a trap frame, we send a command to the ICUs to dismiss the "in service" bit 
(turned on when the processor began processing the interrupt to lockout lower-priority requests). This 
indicates to both ICUs that the interrupt is stacked and that we will allow any subsequent interrupts that 
are unmasked, regardless of priority. This is done as soon as possible to forestall interrupts that might 
otherwise be missed. 


Next, the segment registers the kernel will use are saved and loaded with known values. Alternatively, the 
interrupt could have been received when the processor was in user mode. In that case, both the ES and DS 
segment selectors would reflect the user process, not the kernel. We need worry only about the ES and DS 
registers because the kernel will not use FS and GS. The instruction selector and stack selector are handled 
automatically by the interrupt mechanism of the 386 and are already stacked. 

We then OR in our interrupt vector’s group mask to disable any companion-device interrupts (as well as 
the device). We must save the old mask value on the stack, so that it is put back the way we found it. 
Finally, we stack the unit number of this interrupt vector and reenable interrupts to allow unmasked 
devices to interrupt while we are processing this interrupt (nesting). We can now call a C device 
interrupt-handler function. 


194 




At this point, an interrupt frame is present on the stack, the kernel’s segment registers are set to be 
consistent with what the kernel’s program expects, all competing interrupts are blocked and unrelated 
interrupts are unblocked and active. The interrupt vector stub (see Listing Two ) then calls the C handler 
(in our earlier example wdintr) to process the device’s request and return. At this point, all device interrupt 
routines are effectively called with: 

wdintr(frame) struct intr_frame frame; 

Getting Out of Interrupts 

You’d think that leaving an interrupt would be as easy or easier than getting in, but this is not the case. We 
still have some housekeeping chores to do on the way out that may be unrelated to the interrupt we were 
servicing. This clouds the picture a little, as seen in Listing Three (page 108). This housekeeping is 
basically a performance enhancement that moves the processing of certain kernel services running at 
interrupt level (and blocking the processor from other interrupts) to run just ahead of the user process. 

In anticipation of these chores, we’ve given both interrupts and traps similar stack frames. By popping off 
two words, we are back into a trap frame and ready to look for windows to wash. We restore the previous 
mask into the ICUs and check whether we are returning to a point where interrupts were masked before (a 
nested interrupt or critical section, for example). If so, we branch to do a simple return by unwinding our 
trap frame. We can do this because we are certain that the code we are returning to will eventually return 
to an unmasked level, and then the chores will be done. 

We first check whether any network protocol software interrupts must be processed. Incoming packets 
received by drivers are enqueued at the device-interrupt level, and a software interrupt is requested. On 
some machines (DEC VAX, for instance), the software-interrupt mechanism is implemented in hardware. 
(One sets a bit on a special register, which causes an interrupt when the processor returns to an unmasked 
level.) 

For the 386, we emulate this feature in software. If we find a request, we lockout other competing interrupt 
returns by setting the PROCESSING bit and successively call all protocol input routines that have been 
requested to be called, clearing each request as we go. 

In a similar fashion, we call the software interrupt associated with the rescheduling clock. Earlier versions 
of UNIX did most scheduling calculations and watchdog routines at high levels. (Even worse, because 
they didn’t use queues of processes, linear searches of all processes were done, and this was expensive—cf. 
Version 6.) This method did not enhance UNIX’s reputation in the real-time department. One significant 
change in the BSD kernel was to move as much of the interrupt-level processing out to interruptible points 
and minimize the impact of overall system overhead. Clock processing turns out to account for much 
overhead in this regard (so much so, that 4.2BSD’s 100-Hz clocking was cut back to 60 Hz), so this 
change was well justified. 

Our final chore is to check whether we are returning to a user process, as opposed to the kernel. The 
contents of the code selector are checked to see if the selector will be executing user or kernel code. We’re 
a bit sloppy here, because we must check only the 2 low-order bits of the selector to find out if it belongs 
to the USER ring (3). This suffices for now, but we may have to revisit this code when 386BSD is 
enhanced to support multisegmented programming. 


195 



If we are headed back to a user program, we must quickly check to see if the kernel has marked this 
process to be rescheduled. If so, we take the rescheduling trap we prepared for at the start of the INTR() 
macro. As mentioned in the 386BSD articles on multiprogramming (see DDJ, September and October 
1991), we have few opportunities to allow rescheduling. We could induce a case in which we’re safe from 
deadlocks by rescheduling when we are returning to the user program. 

UNIX as Interface--Is it Adequate? 

Although every edition of UNIX has contained innovative new areas of design, many device interfaces 
have not been altered to take advantage of this new technology. As such, the mechanism for 
communicating with the device driver is virtually identical from micro- to supercomputer, from first to last 
edition. 

Few people are willing to take on the daunting (or insane) challenge of breaking and fixing every single 
driver created! Because the best drivers already exploit the current interface to great advantage, changing 
the interface may seem a bigger problem than it is worth. 

The rewards coupled with this challenge are less obvious, so a careful look at emerging technology 
considerations is in order. A problem this big must be carefully outlined: 

Multithreaded I/O Devices. A near-term nuisance, commonly noticed with the use of disk arrays in 
particular, is the difficulty in adapting the characteristics of multithreaded (that is, more than one 
concurrent stream of I/O operations) devices to the flat, strictly synchronous I/O model implied by the 
current UNIX device-driver model. This does not suggest an arbitrary jump to an asynchronous model. 
History has shown the simplicity and elegance of the synchronous model to be valuable allies in the battle 
against operating-system kernel "bloat." However, here we need coordinated, synchronous scheduling of 
I/O that can be used by the higher levels in the operating system to arrange the "time domain" of 
multithreaded operation to best advantage. 

Streamer Devices. Almost every UNIX system has a magical kludge to support streaming devices such as 
cartridge tape, DAT, or 8MM video tape. These devices require extensive buffering to maintain 
"streaming operation"—in which the device can synchronously accept data to match its physical transfer 
"commitment". If this commitment is not met, the device must stop, rewind, get up to speed again, and 
re-do the transfer when the data is finally present. This causes the tape drive to "wheeze" back and forth 
and the data transfer to slow to a fraction of its top data rate; this turns system backups into interminable 
waits. Worse yet, the nature of these devices is that you only learn of a tape write error long after the 
driver has reported the block "written." This is because our UNIX device-driver interface implies a 1:1 
relationship between "raw" transfers and their error status, which means that only one can be outstanding 
at a given time. Therefore, most of the kludges break the rules to attempt to keep the device 
double-buffered, all to varying degrees of success. The amount of RAM and disk storage is going up 
radically with time (some say soon to about 1 gigabyte per user), so high-speed backup is not just 
desirable; it’s necessary for survival! We need a class of device with so-called "tear-away" I/O, where a 
portion of a process’s address space is passed off to the device driver and the backup process. This way, a 
queue of regions of memory can be operated on by the driver in strict, synchronous order, with the depth 
of the queue appropriate to the demands of the device. 


196 



Redundancy of Common Code. Even with the previously mentioned efforts to collect driver common code 
into what is effectively a support library of routines, many of the system’s drivers still have considerable 
"common code" left inside. You begin to wonder if this method is adequate. An alternative approach 
would be to express the routines as methods of a driver "object" in an object-oriented language, such as 
C++. Thus, the commonality could be dealt with by having classes of drivers, and the state of an active 
driver/device could be expressed as an instance variable created on demand. In other words, it may be time 
to apply object-oriented languages to the kernel itself. 

Information Caching. The driver interface also suffers in terms of cached information possibly shared by 
other processors (say, in a multiprocessor or a dual-ported disk or via a network). To maintain cache 
consistency, invalidation mechanisms are necessary, perhaps even at the driver level. Also, cache 
mechanisms above the driver level need additional information about the status of a drive’s queued 
transfers to determine if the drive is already saturated with requests. If so, the cache should decline 
requests on that drive until it finishes that to which it has already committed. Drivers should be viewed as 
presenting an employable set of capabilities that the file-system cache can exploit. 

The device-driver interface should provide a real-time schedule for I/O to conclude in a deterministic 
period, and a mechanism for ordered writes in a strictly ordered fashion. (With almost every UNIX system 
today, one has to "sync" multiple times to ensure that sandbagged disk I/O makes it out before shutting 
down the system.) The file-system cache is also handicapped on writes because it must keep the disk 
consistent—so a single write of an additional block of data in a file causes as many as three or four separate 
writes to update related data structures on the disk to keep them consistent. By adding these missing 
mechanisms, the system’s cache could maintain stability without losing performance. 

Device Conflict. Because autoconfiguration is usually only done at bootload time, devices that conflict 
(and are hence "lost") cannot be used, and devices cannot be added after the system is up and running. 

This was acceptable when BSD UNIX ran on VAX mainframes for months at a time between 
reconfigurations, but it’s irritating to constantly reboot a PC if, say, a device was unplugged or turned off 
when the system last booted. Configuration and resource allocation (interrupts, DMA channels) need to be 
done uniformly, at will, and in an automatic fashion. 

File System to Device Disparity. The semantics of many device drivers fit the "file" metaphor, but UNIX 
drivers don’t have the complete semantics of files at all. Although this is no great loss in many cases (for 
example, unit record devices, such as serial ports and printers), for others (notably mass storage devices), 
file attributes such as size, write-protection, modes, and even device-naming conventions visible to the 
programmer are not visible to the device driver, causing a loss of potential functionality. We can repair 
this by developing a special device (spec) file system, where device drivers are attached. In this case, the 
primitive operators of the Virtual Filesystem of the BSD kernel correspond to all of the user-level process 
file operations to the device driver itself. (For example, we export the virtual file abstraction down to the 
device driver.) This is currently implemented in BSD. At the last moment, however, it converts to the 
ancient UNIX device-driver conventions to avoid dealing with the problem (for now). (If this sounds like 
shades of "Plan-9," who are we to say differently?...) 

MACH devotees have decreed that device drivers in user processes are a must, in order to allow them to 
be dynamically loaded. But by choosing our file abstraction appropriately, we can gain the facility for 
loadable drivers without all the smoke and mirrors. If access through to the drivers is uniform through the 
central concept of a virtual file system, then the interface can be "remoted," much as NFS provides for 


197 



remote files. (This is just an analogy, not a suggested mechanism.) In short, the choice of a better 
device-driver interface can offer substantial advantages without necessarily requiring large amounts of 
radical code that would contribute to the bloat factor of the kernel. 

[LISTING ONE] 

/* [Excerpted from /sys/i386/isa/icu.h] */ 

/* Macro's for interrupt level priority masks (used in assembly code) */ 

/* mask additional interrupts */ 

♦define ORPL(m) \ 

cli ; /* disable interrupts */ \ 

movw m , %dx ; /* get the mask */ \ 

inb $ I0_ICU1+1, %al ; /* next, get low order mask */ \ 

xchgb %dl, %al ; /* switch the old with the new */ \ 

orb %dl, %al ; /* finally, or it in! */ \ 

outb %al, $ I0_ICU1+1 ; \ 

inb $0x84, %al ; \ 

inb $ IO_ICU2+l, %al ; /* next, get high order mask */ \ 

xchgb %dh, %al ; /* switch the old with the new */ \ 

orb %dh, %al ; /* finally, or it in! */ \ 

outb %al, $ IO_ICU2+l ; \ 

inb $0x84, %al ; /* flush write buffer, delay bus cycle */ \ 

movzwl %dx, %eax ; /* return old priority */ \ 

sti ; /* enable interrupts */ 

/* force interrupt mask */ 

♦define SETPL(v) \ 

cli ; /* disable interrupts */ \ 

movw v , %dx ; \ 

inb $ I0_ICU1+1, %al ; /* next, get low order mask */ \ 

xchgb %dl, %al ; /* switch the old with the new */ \ 

outb %al, $ I0_ICU1+1 ; \ 
inb $0x84, %al ; \ 

inb $ IO_ICU2+l, %al ; /* next, get high order mask */ \ 

xchgb %dh, %al ; /* switch the copy with the new */ \ 

outb %al, $ IO_ICU2+l ; \ 

inb $0x84, %al ; /* flush write buffer, delay bus cycle */ \ 

movzwl %dx, %eax ; /* return old priority */ \ 

sti ; /* enable interrupts */ 

/* Mask a group of interrupts atomically - interrupt entry */ 

♦define INTR(unit,mask,offst) \ 

pushl $0 ; /* first, build a trap frame for ... */ \ 

pushl $ T_ASTFLT ; /* ... possible rescheduling that may occur */ \ 

pushal ; \ 
nop ; \ 

movb $0x20, %al ; /* next, as soon as possible send EOI ... */ \ 

outb %al, $ I0_ICU1 ; /* ...so in service bit may be cleared ...*/ \ 
inb $0x84, %al ; /* ... ASAP */ \ 

movb $0x20, %al ; /* likewise, the other one as well */ \ 

outb %al,$ IO_ICU2 ; \ 
inb $0x84,%al ; \ 

pushl %ds ; /* save our data and extra segments ... */ \ 

pushl %es ; \ 

movw $0x10, %ax ; /* ... and reload with kernel's own */ \ 

movw %ax, %ds ; \ 

movw %ax, %es ; \ 


198 



incl _cnt+V_INTR ; /* tally interrupts */ \ 

incl _isa_intr + offst * 4 ; \ 

movw mask , %dx ; /* assert group mask */ \ 

inb $ I0_ICU1+1, %al ; /* next, get low order mask */ \ 
xchgb %dl, %al ; /* switch the old with the new */ \ 

orb %dl, %al ; /* finally, or it in! */ \ 

outb %al, $ I0_ICU1+1 ; \ 
inb $0x84,%al ; \ 

inb $ I0_ICU2+1, %al ; /* next, get high order mask */ \ 

xchgb %dh, %al ; /* switch the old with the new */ \ 

orb %dh, %al ; /* finally, or it in! */ \ 

outb %al, $ IO_ICU2+l ; \ 
inb $0x84, %al ; \ 

pushl %edx ; /* save old mask for when we return */ \ 

pushl $ unit ; /* finish off interrupt frame with unit # */ \ 

sti ; /* and allow other unmasked interrupts */ 


[LISTING TWO] 

/* AT/386 — Interrupt vector routines — Generated by config program */ 

#include "machine/isa/isa.h" 

#include "machine/isa/icu.h" 

♦define VEC(name) .align 4; .globl _V/**/name; _V/**/name: 

.globl _hardclock 
VEC(elk) 

INTR(0, _highmask , 0) 

call _hardclock 

INTREXIT1 

.globl _wdintr, _wd0mask 
. data 

_wd0mask: .long 0 

. text 
VEC(wdO) 

INTR(0, _biomask , 1) 

call _wdintr 

INTREXIT2 

.globl _fdintr, _fd0mask 
. data 

_fd0mask: .long 0 

. text 
VEC(fdO) 

INTR(0, _biomask , 2) 

call _fdintr 

INTREXIT1 

.globl _cnrint, _cn0mask 
. data 

_cn0mask: .long 0 

. text 
VEC(cnO) 

INTR(0, _ttymask , 3) 


199 



call _cnrint 
INTREXIT1 

.globl _npxintr, _npxOmask 
. data 

_npxOmask: .long 0 

. text 
VEC(npxO) 

INTR(0, _npxOmask, 4) 
call _npxintr 
INTREXIT2 

.globl _comintr, _comOmask 
. data 

_comOmask: .long 0 

. text 
VEC(comO) 

INTR(0, _ttymask_, 5) 

call _comintr 
INTREXIT1 

.globl _weintr, _weOmask 
. data 

_weOmask: .long 0 

. text 
VEC(weO) 

INTR(0, _netmask , 6) 

call _weintr 
INTREXIT1 

.globl _lptintr, _lptOmask 
. data 

_lptOmask: .long 0 

. text 
VEC(lptO) 

INTR(0, _ttymask , 7) 

call _lptintr 
INTREXIT1 

[LISTING THREE] 


/* [Excerpted from /sys/i386/isa/icu.h] */ 

/* First eight interrupts (ICU1) */ 

#define INTREXIT1 \ 

jmp doreti 

/* Second eight interrupts (ICU2) */ 

♦define INTREXIT2 \ 
jmp doreti 

/* [Excerpted from /sys/i386/isa/icu.s] */ 

/* Handle return from interrupt after device handler finishes */ 
doreti : 

/* move to a trap frame */ 

cli /* interrupts off while we work ... */ 

popl %ebx /* remove unit number */ 


200 



popl 


%eax 


/* get previous priority mask */ 


/* restore previous mask */ 

movw %ax, %cx 

outb %al, $ I0_ICU1+1 

inb $0x84, %al 

movb %ah, %al 

outb %al, $ IO_ICU2+l 

inb $0x84, %al 

/* are we at interrupt level / nested interrupt already ? */ 

cmpw _nonemask_, %cx 

jne 3f 

/* do we need to process an network software interrupt ? */ 


cmpl 

$0, _netisr 

je 2f 


btsl 

$ NETISR_PROCESS, _netisr 

jb 2f 



#include "../net/netisr.h" 

♦define DOCALL(n, s, c) ; \ 

.globl c ; \ 
btrl $ s , n ; \ 

jnb If ; \ 
call c ; \ 

1 : 

/* process a network software interrupt */ 
sti 

DOCALL(_netisr, NETISR_RAW, _rawintr) 

#ifdef INET 

DOCALLLnetisr, NETISR_IP, _ipintr) 

#endif 
#ifdef IMP 

DOCALL(_netisr, NETISR_IMP, _impintr) 

#endif 
#ifdef NS 

DOCALL(_netisr, NETISR_NS, _nsintr) 

#endif 

btrl $ NETISR_PROCESS, _netisr 

2 : 

/* do we need to process a software clock "interrupt" */ 
cli 

btrl $ SCLK_NEED, _softclock 

jnb If 

btsl $ SCLK_PROCESS, _softclock 

jb If 

/* process a software clock "interrupt" */ 


sti 




pushl 

_nonemask_ 

/* to an interrupt frame again 

*/ 

pushl 

%ebx 



call 

_softclock 



popl 

%eax /* 

back to trap frame for possible 

AST 

popl 

%eax 



btrl 

$ SCLK_PROCESS, 

_softclock_ 



1 : 


201 



/* see if we need to process an AST (rescheduling) fault */ 

cmpw $0xlf, tCS*4 (%esp) /* were we executing from a user mode ... */ 

jne 3f /* ... code selector? */ 

DOCALL(_ast_, AST_NEED, _trap); 

/* restore the state and return */ 

popl %es 

popl %ds 

popal 

nop 

addl $8, %esp 

iret 


Porting Unix To The 386: Missing Pieces, Part 1 

Finishing the NET/2 release of Berkeley UNIX to obtain a complete, unencumbered system for the 386 
PC. Describes the methodology and implementation of the remaining facilities necessary to generate a 
working operating system for the PC. 

Completing the 386BSD kernel 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual-memory, microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to ljoIitz@cardio.ucsf.edu. (c) 1992 TeleMuse. 

When we began the 386BSD project in 1989, 386BSD was simply intended to be a port of BSD to the 
386. Our purpose in doing 386BSD was so students, faculty, staff, and researchers could use BSD on a 
simple and inexpensive platform. While we did not wish to add to anyone’s proprietary license revenues 
by folding in new encumbered code (especially pertaining to the 386), removing or redesigning new code 
to replace old encumbered code was out of the scope of this project. And since we were the only ones 
willing to work gratis on 386BSD, making an unencumbered version was impossible. After we 
contributed 386BSD to the University of California at Berkeley (UCB) in December 1990, the UCB staff 
seriously began to set their sights on releasing only unencumbered code. As you might expect, it was quite 
a chore to continually update 386BSD and revise it, matching the work done by UCB staff. The result was 
the University of California Berkeley Networking Software, Release 2 ("NET/2"). 

In a break with past installments, we’ve taken the unencumbered but incomplete NET/2 kernel and 
finished the missing pieces necessary to make a bootable running kernel that provides a self-supporting 
development environment. In order to make the complete 386BSD kernel, we had to complete some code 
that was not available when UCB composed the NET/2 tape. Some of these areas, such as execve(), were 
simply not available at the time, while others (clists, resource maps, buffer cache) were based on obsolete 
portions of the system due to be replaced by more modern facilities (such as streams and the page cache). 
In fact, we have found that many of these "new" facilities are still quite far down the road. 


202 



We needed to create these "missing pieces," so we used the NET/2 system itself to evolve the eventual 
replacements for these same pieces. To replace the missing clists, we chose to design ring buffers to 
function in their stead. In place of the missing resource maps, we invented a more flexible mechanism 
called a resource list that exploits the dynamic memory-allocation mechanism already present in the 
NET/2 kernel. For the buffer-based I/O mechanism and program-execution system call, we relied on 
reference materials to design a suitable substitute to serve us while we slowly work on our own, newer 
version of a page cache which utilizes a completely different approach. 

While we still intend to emphasize innovative work, we are constrained within the confines of the present. 
Thus, we have chosen the shortest path to completion of the 386BSD goal, by implementing facilities that 
will dock to the rest of the NET/2 kernel source with minimal change. In fact, the source code presented 
over the next two installments—plus a small set of bug fixes and a recent copy of the NET/2 tape (available 
via M&T Online)—will enable you to build an operational kernel. Before you go running for the phone, 
please realize that besides a kernel, you need bootstraps (DDJ, February 1991), binaries of the utilities 
(DDJ, March. April, May 1991), a root file system (DDJ, July, August, September, and October 1991), an 
installation mechanism (DDJ, February and May 1991 as well as March and April 1992), and 
documentation (DDJ, January through November 1991, and February 1992 onward) before making 
386BSD a real bootable system. 

Readers who have followed this series now have enough material and information to put in the elbow 
grease and finish it on their own. For those with less patience, we hope to follow this up shortly with a real 
386BSD binary that can be put on a PC without major travail. 

386BSD Kernel Completion Methodology 

The methodology we followed to complete the 386BSD NET/2 kernel was critical to our success. We 
began with an examination of the documentation that described how each of these missing facilities 
managed to function, and then reviewed the interface structure within the NET/2 kernel. From this 
information, we created a model of semantics for each of the missing facilities. Among the references we 
found useful were Maurice Bach’s The Design of the UNIX Operating System, Tanenbaum's MINIX 
series, the BSD book, the UNIX System V Programmers Reference Manual, Knuth’s The Art of Computer 
Programming, as well as selected readings from USENIX Technical Proceedings over the years. 
Comparing these references is interesting because they all contain different perspectives which "color" the 
puzzle piece slightly differently. Using these reference materials, we wrote the first versions of these 
modules from scratch. Then, by examining functional entry points that called the prototype version, we 
discovered weaknesses in the original assumptions—some of which required significant revision. This 
approach also allowed us to construct a realistic test program around the intended kernel code that 
facilitated developing and debugging code in a user-process "framework." 

Although modern kernel debuggers and other development tools are now ubiquitous, isolated debugging is 
still valuable because it allows you to circumscribe the problem more efficiently. For example, a number 
of bugs in the NET/2 kernel which had been masked by the older missing code were discovered by this 
process. The independent validation that isolated, user-mode development provides is a valuable tool for 
pinpointing problem areas in the NET/2 code as well as in the new software. 


203 



Once a rough version was made to work in a user framework, the complex cases were tested to localize 
implementation problems. It’s said that 90 percent of your bugs come from less than 1 percent of your 
source code. However, you can usually guess where the trouble will strike, and that’s exactly where you 
target your test vectors. Unfortunately, many programmers shy away from this procedure, preferring to 
visually "inspect" the code instead of doing the "acid test." For example, all the bugs in the ring-buffer 
code were located in the boundary-crossing cases on normal and inverse functions. There were analogous 
cases with the contiguous GET/PUT operations, so these too needed to be examined at the boundaries. In 
one case, we made the ring buffers absurdly short to exacerbate the problems on the boundaries, and in 
doing so, we exposed other inadequacies as well. 

Next, we had to contend with the environmental and interface demands of the kernel. No user program 
framework can hope to simulate the interrupt-driven, context-switched, race-prone environment of a basic 
friendly kernel (with the result that the system wedges a lot). Unlike Mach, which hopes to export this 
environment to user processes under the guise of kernelizing the system (the system still wedges, but the 
problems occur in the user environment, making it more difficult to localize the problem), we prefer that 
the kernel environment not be a part of the isolated test framework. When we step into the kernel, it’s an 
all-or-nothing proposition, because the mechanisms interact greatly. This complexity of interaction is 
always present, whether it stays in the kernel or gets exported to a user process. 

The methods through which we regulate the introduction of new code into the kernel are the key to 
ensuring that code’s proper operation. We leverage our understanding of the entry points in order to 
examine and track the actual requests passed to our new code and compare them against what our 
user-mode test model does. 

In a sense, we end up turning our debugging process inside out. Instead of producing cases to perturb the 
inner workings of the new code, we debug the interfaces and the outer procedures that call it. We do this 
methodically, testing the boundary cases of each as we go and looking for unexpected assumptions about 
the semantics of the new code. 

In the case of the buffer cache, this method caught significant "gotchas." Much of the semantics of the 
NET/2 buffer cache are implied by the surrounding code. (For example, the rescaling of buffer size, buffer 
invalidation, and forcing to back store are all intertwined with the virtual file system and subsidiary 
file-system layers.) This could, in part, explain why it has appeared so difficult to replace the obsolete 
buffer cache. Its semantics are spread all over the map! 

With our methodology in place, we began to finish off the missing pieces, arriving, ultimately, at 386BSD 
Unbound. 

386BSD Resource Lists 

One part missing from the NET/2 386BSD kernel is the facility for dense storage, or region allocation, 
known in Berkeley UNIX vernacular as the resource maps. Resource maps were created in 4BSD as a 
generalization of the "core click" physical-memory allocator found in the original Version 6 Bell 
Laboratories UNIX for the PDP-11. They were widely used in the older 4BSD virtual-memory system. 
However, in the NET/2 kernel’s virtual-memory system, they are only used to allocate contiguous hunks 
of swap space to contain swapped-out processes. (Incidentally, the term "map" here is misleading, as this 
has nothing to do with the virtual-memory system’s use of the term "map" to describe using the 
processor’s address-translation hardware.) 


204 



Resource maps work by describing allocatable segments as a two-tuple (index, size). These two-tuples are 
stored in a contiguous array of fixed size. As allocations are made from the map, the segments fragment 
and take up increasing space in the array. When fragments are logically returned to the map, the free() 
procedure glues the fragments back together and attempts to shrink space in the array. If the array is large 
enough, the worst possible fragmentation cannot exceed the size of the resource map. 

Resource maps, while elegant, compact, and quick for PDP-11 memory allocation, have some annoying 
drawbacks. To make a fast implementation, the "Oth" index is not usable, because it is indistinguishable 
from "nothing" on the list (in other words, it’s used as a sentinel), and the entry that corresponds to it is 
used to hold the upper-bounds limit and name of the given resource map instance. As a result, the caller 
either needs to relocate the range above 0 before handing it to the resource map routines or discard the first 
allocation unit (the "Oth" index). In addition, the size and extent of the resource map is fixed at 
initialization time and is unalterable, so storage for the map must be reserved for the "worst-case size." If 
you guess wrong at worst case, you're screwed! 

In many early UNIX systems, some kernels never bounds-checked these arrays at all, and would merrily 
scribble all over the next adjacent memory locations after the map! The system would then run for a 
period of time afterward, and, when the inevitable crash occurred, it appeared to come from an unrelated 
portion of the system. Of course, by that time, the map had become less fragmented, and its contents 
showed no irregularity; in other words, an almost "self-healing" bug. Those who had discovered this 
clever trick protected the map with a "bounds check" which caused a system panic to occur if the map 
fragmented outside the array. After a while, it became tedious to recompile the kernel with greater and 
greater map sizes, so instead of panicking, the resource map allocator would just "drop" the fragment. This 
approach had humorous side effects on large time-sharing systems, when the fragment dropped turned out 
to be something important, such as half of all available swap space, because large fragments tend to collect 
on the end of the list. 


This static mechanism has been tolerated for so long partly because it’s been used as the bottom-level 
storage allocator, and allowing it to use dynamic allocation was considered unwise. (For example, what if 
it fragmented when low on memory?) 

Many design decisions in the kernel change when dynamic memory allocation becomes available, so we 
replaced the fixed allocation resource maps with resource lists built out of a pay-as-you-go dynamic 
memory-allocation scheme. 


Resource Lists Defined 


Resource lists (see |Listing One| ) are arbitrary-length lists, each element of which describes a segment as 
an inclusive [start, end] two-tuple. The list elements are kept in a sorted order, with fragmenting entries 
causing spontaneous new entries to be allocated (via malloc). As segments are freed, allowing holes to be 
filled and fragments to be reassembled, adjacent entries are reduced to single ones, and the now 
superfluous list entries are freed and returned to the dynamic memory allocator. Thus, no loss of storage 
need occur. The price we pay is the added cost of dynamic allocation, which is generally small compared 
to the number of times our resource lists are used. 


205 



Other advantages of resource lists stem from their dynamic nature. There is no initialize function, only 
allocate and free entry calls, because initialization is just passing free space to an otherwise empty list, in 
any order and at any time. This allows us, for example, to add additional swap space without having to 
reinitialize the resource map nor reserve space ahead of time. Also, because the full dynamic range is 
present, segments starting and ending at any point in a 32-bit number’s range can be used. 

rlist_alloc(). The resource list allocate function (see Listing Two| ) traipses down a linked list looking for a 
large enough region from which to allocate. In doing this, it uses a single doubly indirect pointer to check 
for an unallocated entry (a null rlist pointer) as well as hold a pointer to the forward link (in case we need 
to restructure the list). When we find an entry of the appropriate size, we optionally pass it back it’s 
location and reduce the size of the resource-list entry. (The caller might not want it, but this ensures it 
won’t be allocated by others.) If we reduce the size to the point that the element is empty, we free the 
list-entry space and rewire the previous list’s pointer to the succeeding entry (if one exists). 

rlist_free(). The resource list free function (see Listing Two ) is beefier, as it must glue the fragments back 
together and attempt to simplify them (that is, represent the fewest list entries). This routine is actually 
attempting to reverse the damage from the numerous unordered allocations and frees prior to being called. 
Like rlist_alloc(), it walks the list with a doubly indirect pointer, but it searches for a list element it can 
merge into or a point in the sorted list where it can be inserted. If a merge occurs, this function scans the 
entire list, trying to reduce adjacent entries that can be merged as the result of a hole being filled. 

Program Execution Function 

Another missing piece from the NET/2 kernel involves execution of a program from a file, a critical part 
of any system. At some point, we must load a file of binary 386 instructions and execute them. In many 
UNIX implementations, this is one of the most complicated system calls in the entire kernel. 

Executable File Format 

The executable file formats to choose from on the 386 include COEFF, ELF, ROSE, X.OUT, A.OUT, and 
valiants. All have different advantages and adherents, and even Intel’s BCS2 (Binary Compatibility 
Standard) does not give a single conclusive choice for an executable format. Unlike MS-DOS, where an 
.EXE file is the same wherever you go, UNIX compatibility is less certain. 

To make a basic functional system, we implemented a single executable format. It needed to be one our 
GNU linkage editor already generated, so we chose the type 413 style of A.OUT, the original executable 
file format of UNIX, named for the octal value of the magic number leading the header in front of the 
executable file. (A.OUT is short for Assembler OUTput file. With no other arguments, a UNIX assembler 
drops its object file into "a.out" in the current directory. On the PDP 11, such files were directly 
executable by the system’s exec(), and would work if they did not have any undefined external references 
that the loader/link editor would need to satisfy from other object files or libraries.) 

This format was originally created in 3BSD for the VAX, specifically to allow use of paging. Its 
prevalence is due mainly to it being around for a long time and luckily for us, it is perpetuated in the GNU 
loader. The format consists of a "page-cluster" sized header, followed by pages of instruction space (to be 
marked for "read-only" access) and ending with pages of initialized data (to be marked for "read/write" 
access). 


206 





A page cluster is a group of pages logically considered to be a single page. On a VAX, with a tiny page 
size of 512 bytes, clusters of 1 Kbyte reduced paging traffic at the expense of increased fragmentation 
loss. With the 386's 4Kbyte pages, we chose a cluster size of 1, because the pages are of adequate size. 
Alignment isn’t a bad thing, especially at a sector (or block) granularity level, so a page fault can be 
satisfied with an integral number of contiguous disk transfers. However, this results in some wasted space. 
An alternative is to put the header either at the rear of the file (where it becomes a trailer) or include the 
header as part of the address space mapped by the ffle (the first words of instruction space). Type 407, the 
original UNIX executable, worked this way because octal 400 was the jump instruction, and an offset of 7 
caused it to jump over the eight-word header to the first instruction following the header! 


As on the VAX (see |Listing Three| ), we assume that instructions are to be mapped to virtual address 0 on 
up to the end of text (a_text), where the data pages start and continue, including so-called "BSS pages" 
(uninitialized data or a_bss). When the system runs the program, it assigns (somewhere) and creates a 
stack segment, packs it with arguments, and enters the program in user mode at a location (a_entry) 
designated in the header. This type 413 is the predominate executable format on Berkeley UNIX systems, 
as it allows for sharing of the read-only pages and protection of user instructions from accidental or 
deliberate modification. 


For the purposes of basic operation, type 413 executable file format is sufficient. It’s far from ideal, 
however, because the 4-Kbyte header page on the front wastes space. Another rub is that location 0 is 
mapped with a valid page, and at times we would like to detect access to the uninitialized pointers that 
frequently appear as NULL pointers. Such pointers can conceivably point to large structures, so the actual 
illegal reference we want to trap might not occur at 0, but "near" it (perhaps within 64 Kbytes). Therefore, 
we might need to avoid putting anything at the bottom of the virtual address space, which conflicts with 
the way type 413 defines its instruction segment to work. On the 386 not only could a user program have 
NULL pointers to be caught, but the kernel could as well, because they share the same virtual address 
space. 

Additionally, our new kernel uses dynamic memory allocation, and a common problem with this occurs 
when an unintended reference is made to a (stale) pointer to a freed (and frequently reassigned) memory 
region. Such areas are often cleared (set to 0) before use! The upshot of this is that kernel NULL pointers 
are a common problem made more difficult to diagnose because 0 is mapped, thus masking the problem. 

Executable file format also impacts the development of a method for sharing commonly referenced code 
among files. "Shared-object libraries" reduce the amount of disk space taken up by executable files by 
making one physical copy available to many programs simultaneously. The copy is stored separately from 
all the interlinked executables. This can result in considerable space savings, especially by the 
multimegabyte X libraries and toolkits. Shared libraries (and potentially dynamic linking) also provide 
mechanisms to manage the ever-growing complexity of modern software, by exploiting it in an 
object-oriented fashion. 

Given these demands, it’s tempting to create yet another executable file format, but this must be 
considered carefully, as it could affect future editions of 386B5D. 


207 



Next Month 


In next month’s installment, we’ll implement a "bare-bones" execve() system call that allows 386BSD to 
provide basic operation, a block I/O buffer cache used to reduce the cost of UNIX file operations, and ring 
buffers that reduce the cost of tty-character buffer management. 

DDJ 

[LISTING ONE] 

/* Copyright (c) 1992 William Jolitz. All rights reserved. 

* Written by William Jolitz 1/92 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions are met: 

* 1. Redistributions of source code must retain the above copyright notice, 

* this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright notice, 

* this list of conditions and the following disclaimer in the documentation 

* and/or other materials provided with the distribution. 

* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: This software is a component 

* of "386BSD" developed by William F. Jolitz, TeleMuse. 

* 4. Neither the name of the developer nor the name "386BSD" may be used to 

* endorse or promote products derived from this software without specific 

* prior written permission. 

* THIS SOFTWARE IS A COMPONENT OF 386BSD DEVELOPED BY WILLIAM F. JOLITZ AND 

* IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. THIS SOFTWARE SHOULD 

* NOT BE CONSIDERED TO BE A COMMERCIAL PRODUCT. THE DEVELOPER URGES THAT USERS 

* WHO REQUIRE A COMMERCIAL PRODUCT NOT MAKE USE OF THIS WORK. 

* THIS SOFTWARE IS PROVIDED BY THE DEVELOPER "AS IS" AND ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 

* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 

* EVENT SHALL THE DEVELOPER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 

* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 

* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 

* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 

* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 

* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 

* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 

* Resource lists. Usage: 

* rlist_free(Sswapmap, 100, 200); add space to swapmap 

* rlist_alloc(Sswapmap, 100, Sloe); obtain 100 sectors from swap 
*/ 


/* A resource list element. */ 
struct rlist { 

unsigned rl_start; /* boundaries of extent - inclusive */ 
unsigned rl_end; /* boundaries of extent - inclusive */ 

struct rlist *rl_next; /* next list entry, if present */ 

} ; 

/* Functions to manipulate resource lists. */ 

extern rlist_free _P((struct rlist **, unsigned, unsigned)); 

int rlist_alloc _P((struct rlist **, unsigned, unsigned *)); 


208 



extern rlist_destroy _P((struct rlist **)); 


/* heads of lists */ 
struct rlist *swapmap; 

[LISTING TWO] 

/* Copyright (c) 1992 William Jolitz. All rights reserved. 

* Written by William Jolitz 1/92 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions are met: 

* 1. Redistributions of source code must retain the above copyright notice, 

* this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright notice, 

* this list of conditions and the following disclaimer in the documentation 

* and/or other materials provided with the distribution. 

* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: This software is a component 

* of "386BSD" developed by William F. Jolitz, TeleMuse. 

* 4. Neither the name of the developer nor the name "386BSD" may be used to 

* endorse or promote products derived from this software without specific 

* prior written permission. 

* THIS SOFTWARE IS A COMPONENT OF 386BSD DEVELOPED BY WILLIAM F. JOLITZ AND 

* IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. THIS SOFTWARE SHOULD 

* NOT BE CONSIDERED TO BE A COMMERCIAL PRODUCT. THE DEVELOPER URGES THAT USERS 

* WHO REQUIRE A COMMERCIAL PRODUCT NOT MAKE USE OF THIS WORK. 

* THIS SOFTWARE IS PROVIDED BY THE DEVELOPER "AS IS" AND ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 

* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 

* EVENT SHALL THE DEVELOPER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 

* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 

* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 

* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 

* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 

* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 

* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ 

#include "sys/param.h" 

♦include "sys/cdefs.h" 

♦include "sys/malloc.h" 

♦include "rlist.h" 

/* Resource lists. */ 

/* Add space to a resource list. Used to either 

* initialize a list or return free space to it. */ 
rlist_free (rip, start, end) 

register struct rlist **rlp; unsigned start, end; { 
struct rlist *head; 
head = *rlp; 
loop: 

/* if nothing here, insert (tail of list) */ 
if (*rlp == 0) { 

*rlp = (struct rlist *)malloc(sizeof(**rlp), M_TEMP, M_NOWAIT); 

(*rlp)->rl_start = start; 

(*rlp)->rl_end = end; 

(*rlp)->rl_next = 0; 
return; 


209 



} 

/* if new region overlaps something currently present, panic */ 
if (start >= (*rlp)->rl_start && start <= (*rlp)->rl_end) 
panic("overlapping rlist_free: freed twice?"); 
if (end >= (*rlp)->rl_start && end <= (*rlp)->rl_end) 
panic("overlapping rlist_free: freed twice?"); 

/* are we adjacent to this element? (in front) */ 
if (end+1 == (*rlp)->rl_start) { 

/* coalesce */ 

(*rlp)->rl_start = start; 
goto scan; 

} 

/* are we before this element? */ 
if (end < (*rlp)->rl_start) { 

register struct rlist *nlp; 

nip = (struct rlist *)malloc(sizeof(*nlp), M_TEMP, M_NOWAIT); 
nlp->rl_start = start; 
nlp->rl_end = end; 
nlp->rl_next = *rlp; 

*rlp = nip; 
return; 

} 

/* are we adjacent to this element? (at tail) */ 
if ((*rlp)->rl_end + 1 == start) { 

/* coalesce */ 

(*rlp)->rl_end = end; 
goto scan; 

} 

/* are we after this element */ 
if (start > (*rlp)->rl_end) { 
rip = &((*rlp)->rl_next); 
goto loop; 

} else 

panic("rlist_free: can't happen"); 

scan: 

/* can we coalesce list now that we've filled a void? */ 

{ 

register struct rlist *lp, *lpn; 
for (lp = head; lp->rl_next ;) { 

lpn = lp->rl_next; 

/* coalesce ? */ 

if (lp->rl_end + 1 == lpn->rl_start) { 
lp->rl_end = lpn->rl_end; 
lp->rl_next = lpn->rl_next; 
free(lpn, M_TEMP); 

} else 

lp = lp->rl_next; 

} 

} 

} 

/* Obtain a region of desired size from a resource list. If nothing available 

* of that size, return 0. Otherwise, return a value of 1 and set resource 

* start location with *loc. (Note: loc can be zero if we don't wish value) */ 
int rlist_alloc (rip, size, loc) 

struct rlist **rlp; unsigned size, *loc; ( 

register struct rlist *lp = *rlp, *olp = 0; 

/* walk list, allocating first thing that's big enough (first fit) */ 


210 



for (; *rlp; rip = &((*rlp)->rl_next)) 

if(size <= (*rlp)->rl_end - (*rlp)->rl_start + 1) { 

/* hand it to the caller */ 
if (loc) *loc = (*rlp)->rl_start; 

(*rlp)->rl_start += size; 

/* did we eat this element entirely? */ 
if ((*rlp)->rl_start > (*rlp)->rl_end) ( 
lp = (*rlp)->rl_next; 
free (*rlp, M_TEMP); 

*rlp = lp; 

} 

return (1); 

} 

/* nothing in list that's big enough */ 
return (0); 

} 

/* Finished with this resource list, reclaim all space and 
* mark it as being empty. */ 
rlist_destroy (rip) 
struct rlist **rlp; { 

struct rlist *lp, *nlp; 

lp = *rlp; 

*rlp = 0; 

for (; lp; lp = nip) { 
nip = lp->rl_next; 
free (lp, M_TEMP); 

} 

} 

[LISTING THREE] 


/* Excerpted with permission from 4.3BSD include file 

* "/usr/include/sys/exec.h" 

* Redistribution and use in source and binary forms are freely permitted 

* provided that the above copyright notice and attribution and date of work 

* and this paragraph are duplicated in all such forms. 

* THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR 

* IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 

* WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. 


* Header prepended to each a.out file. 
*/ 


struct exec { 


long 

a_magic; 

/* 

unsigned 

long 

a_text; 

/* 

unsigned 

long 

a_data; 

/* 

unsigned 

long 

a_bss; 

/* 

unsigned 

long 

a_syms; 

/* 

unsigned 

long 

a_entry; 

/* 

unsigned 

long 

a_trsize; 

/* 

unsigned 
} ; 

long 

a_drsize; 

/* 

#define < 

DMAGIC 

0407 

/* 

♦define NMAGIC 

0410 

/* 

♦define 1 

ZMAGIC 

0413 

/* 


magic number */ 

size of text segment */ 

size of initialized data */ 

size of uninitialized data */ 

size of symbol table */ 

entry point */ 

size of text relocation */ 

size of data relocation */ 


old impure format */ 
read-only text */ 
demand load format */ 


211 



Porting Unix To The 386: Missing Pieces, Part Ii 

Finishing the NET/2 release of Berkeley UNIX to obtain a complete, unencumbered system for the 386 
PC. Describes the methodology and implementation of the remaining facilities necessary to generate a 
working operating system for the PC. 

Completing the 386BSD kernel 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual-memory, micro-processor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to ljolitz@cardio.ucsf.edu, (c) 1992 TeleMuse. 

Last month, we began the final steps of a journey that will lead to a bootable running 386BSD kernel that 
provides a self-supporting development environment. The source code presented over the last two 
installments—plus a small set of bug fixes and a recent copy of the NET/2 tape—will enable you to build an 
operational kernel. 

This month, we'll implement a "bare-bones" execve() system call that allows 386BSD to provide basic 
operation; a block-I/O buffer cache used to reduce the cost of UNIX file operations, and ring buffers that 
reduce the cost of tty-character buffer management. 

Leveraging the POSIX Definition 

POSIX 1003.1 describes the semantics of file execution and the requirements of the implementation. It 
defines, from the point of view of a C program’s main() procedure, just what needs to be delivered to the 
new process image. Aside from the arguments passed to the new process image, it also describes the 
structure of the environment that is passed and the treatment of other systems facilities (such as files, 
signals, and process credentials). It also defines the possible error conditions that exec() should return if it 
cannot correctly complete the request. All the various function calls of the exec() family are implemented 
in the NET/2 object library, and they all eventually translate down to an execve() system call that actually 
does the work. 

Choices of Implementation 

While POSIX says nothing about system-call semantics (because it’s entirely an object-library based 
standard), both The UNIX System V Programmer’s Reference Manual and the BSD Programmer’s 
Manual fill in the minor details lacking in the standard. In fact, we could even implement the execve() 
function from user code that manipulates the address space with special system calls. During early 
discussions at Berkeley concerning the 4.2BSD design process, "RISCizing" the system calls was strongly 
considered, and the system call that topped the lists for this treatment was exec(). (It was to be built out of 
segment-creation calls, piece by piece.) Recent versions of Mach have implemented a program loader that 
works somewhat like this, in that the understanding of various executable formats can be exported to a 
user program loader. In MINIX, the responsibility is sensibly split between user and kernel: The user 


212 



exec() subroutine gathers a buffer of argument pointers, and the kernel copies the buffer into the new 
process and then fixes up the argument pointers by relocating them (now that the new location is known in 
the new process’s image). Unfortunately, we cannot implement these approaches because the Intel BCS 
(Binary Compatibility Standard) specifies that execve() is a system call, and a system call implements the 
full semantics of execve()’s operation. 

The costs in execveQ arise primarily from collecting all the argument (and environment) strings, saving 
them temporarily, and then depositing them in the new process. The representation of argument strings is 
dense; for example, the strings are adjacent at the top of the new stack with their corresponding pointers. 
Therefore, we generally use a two-pass algorithm that determines the size of the space to hold the strings 
and the number of strings that exist. Thus, we can reserve space and determine the relocation of the strings 
and arguments in the new process’s image—remember, we’re packing things together at the top of the 
stack, working from the top down. The principle reason for tightly packed arguments is so that a program 
knows that it can have (at most) ARG_MAX bytes of incoming arguments, and that those bytes can be 
saved away (by a bcopy, perhaps) to another region without chasing each of them down. 

If we don’t mind wasting some space, we can make the algorithm single pass, by assuming where the 
arguments and strings will go, copying them into position, and doing the relocation all at the same time. 
We can’t destroy the old process’s stack (who knows, an argument might point at an invalid location, or 
worse yet, at the location we are copying to for the new image), so we need to save our results elsewhere. 
By forcing this buffer to be aligned on an appropriate page boundary instead of copying it to the new 
address space, we can map the page accordingly and maintain our single-pass objective. 


execveQ 


Our implementation (see |Listing One| , page 96) contains a bare-bones execve() system call that allows 
386BSD to provide basic operation. This minimal version is compact enough to discuss easily, yet 
complete enough to allow a system using it to rebuild the kernel. In other words, this is not a toy 
implementation! 


As a system call in the kernel, this procedure obeys certain conventions that all system-call code in our 
386BSD kernel uses. The arguments to all system-call code are identical in number and use. The first 
parameter is always a pointer to the process structure of the process requesting the system call, and all 
process-relative facilities, resources, and state information are referenced solely through this pointer. (See 
Figure 2 in our August 1991 386BSD article.) The next argument is a pointer to the user’s arguments for 
this system call. Each system call itself has different numbers and types of arguments, so it is a pointer to a 
structure that is usually different per each system call. The execve system call itself has three arguments: 
the filename of the file to be executed, a pointer to a vector of string arguments for the new process, and a 
pointer to a vector of string environment variables. The last parameter of the execve() procedure is the 
return value pointer, which can be used to return a data value to the caller of the system call. 

Independently, an error can be passed to the system-call management function sys-call(), which calls all 
system-call functions in the kernel by the execveQ function returning a value. 


This implementation can be broken down into five separate steps: file validation, executable format 
recognition and consistency checking, reading and processing argument strings, building a new process 
image, and preparing the new image for execution. 


213 



File Validation 


We first check whether the file we’ve been asked to execute actually exists, and if so, the internal name by 
which to refer to it (the vnode). We do this by employing the file-lookup function called namei(). This 
function has so many arguments that it actually requires a structure (nameidata) to detail all the related 
matters that will occur as a result of its use. We instruct namei() to LOOKUP the filename and return a 
pointer to an internal file reference (a vnode or abstract file node). We then LOCK it so that others don’t 
alter it while we process our execve() system call, and then interpret any symbolic links we encounter as 
we incrementally evaluate the filename, namei is also told that the filename to be requested is in the user 
process’s address space, located in the address specified by the first argument of the execve() system call. 
When namei() succeeds, we check whether the file is "regular" (that is, not a directory, socket, device, 
fifo, and so on), and whether it’s an executable file about which we can obtain status information. All 
these calls occur relative to our "virtual," or abstract file-system level, and the calls hide access to the 
actual mechanisms that implement the file system’s operations. The main result of this step is a pointer to 
a vnode (ndp ->ni_vp) that points to a qualified file. 

Executable Format Recognition and Consistency Checking 

Now that we have a file we can read, we need to divine its format and check this against what we can 
execute. 

vn_rdwr() is a vast, kitchen-sink styled internal kernel procedure that implements the general scheme for 
reading a desired amount of data from a vnode. It places this in our prototype header structure (hdr), so we 
can then dig through it to validate that it’s a recognizable format with a sane request. We don’t want to 
execute if the file is too small or the parameters are not aligned with page boundaries. We check sizes, first 
separately and then together, to avoid the chance that together they overflow our 4-gigabyte address space. 
The result is that we now have a vnode that contains an image we can load and execute. 

Reading and Processing Argument Strings 

We need to collect argument and environment strings prior to loading the new image. In this 
implementation, we take advantage of the "floating" location of the user stack (no absolute reference has 
been assumed in 386BSD) and our 4-Gbyte process address space. We create a "new" stack within the 
"old" image in its new process-image location, and then build it in place. Thus, we consume 32 Mbytes of 
virtual address space, but only allocate three to ten pages of actual memory touched ("sparse" allocation). 
Vector by vector, we obtain the address of argument strings and use copyinoutstr() to obtain a string from 
the user’s address space that does not exceed the size of our temporary buffer. (Unlike applications 
programming, kernel work is loaded with obscurely arranged procedures that service special-purpose 
needs ideally.) We build a pointer to the string that corresponds to its location in the new process’s image. 
After doing this for all arguments, we repeat the procedure on the environment vector, thus reusing the 
code for a similar purpose. Due to sparse allocation, three objects (argument vectors, strings, and stack) 
each take up at least a page. (These could be condensed to a single page if speed considerations are 
subordinate to space considerations.) 


214 



Building a New Process Image 


We can do no more at this point with the old image—we must destroy the old to build the new. We can no 
longer return to the image we were called from and arc now committed. We must ask the virtual memory 
system to erase the old process image and map the new executable file in its place, taking care to leave the 
new stack present. Note that no I/O is yet done, and that the pages will be demand-loaded on access. If, 
following the vm_mmap(), we referenced the bottom of the virtual address space where this file is 
mapped, a fault would be generated. The page in the file associated with the address would be read in and 
our reference would be satisfied. MAP_COPY and VM_PROT_ALL specify that the pages may be 
referenced for all purposes (read, write, and execute), and that any changes will affect this process’s copy 
only. We want instructions to be protected from modification by the program, so vm_protect() allows us to 
restrict valid references to read and execute (not write) over the extent of the text (instruction) portion of 
the address space. Next, we allocate any remaining uninitialized data address space with anonymous 
paged memory from the virtual memory system. The virtual memory system has simply been instructed in 
building data structures that it can consult to decide how to handle faults in specific portions of the 
processor's address space. No pages of memory have been allocated, nor have any parts of the processor’s 
address-mapping hardware been touched in building the new process image, other than to invalidate the 
address range. 

Preparing the New Image for Execution 

Before we return from execve(), we must inform the system of the new image’s characteristics, close off 
any files as needed, and reset caught signals. We set the stack pointer and other registers (setregs for the 
PC), unlock and release the file so others can mess with it, and return to the new image we have just 
effectively loaded. We have not done a shred of I/O to read in the instruction pages, so the first return to 
start the user process is guaranteed to generate a page fault. The virtual memory system will then consult 
the information from the previous step to satisfy a "page in" request from our executable file. This is how 
the cycle of life begins for our new incarnation of the process. 

What’s Not Finished with exeeve? 

To be POSIX compliant, we should implement the famous setuid/setgid features of UNIX, by extracting 
information out of the file-attribute buffer and altering the process’s credentials (for instance, user id and 
group id). We’ve neglected any details concerning statistics updating/gathering and accounting. Another 
intentional oversight was neglecting the interface to the user-process debugging mechanism. Each of the 
items results in small additions to this implementation. These are good exercises for the enthusiast but are 
outside the scope of this article. 

Importantly, the 386BSD system will now operate, and allow us to recompile the kernel, even without 
these additional changes, but the fully fleshed-out execve() does allow programs facilities like su(l) to 
work. 

Another useful functionality not discussed is the ability to execute more than one kind of file format. We 
are unsure how far to extend 386BSD in this direction, as there are literally dozens of potential formats. 
One possibility is to create a new format that just addresses the weaknesses we have seen. This may be 
necessary for the multiprocessor, multithreaded version of 386BSD. 


215 



Block I/O Cache 


Much of the file-system code and various other facilities of our BSD kernel use the ancient UNIX 
buffer-cache interface. This buffer cache, splendidly and generously described in Maurice Bach’s The 
Design of the UNIX Operating System, implements a file-system, block-oriented cache of I/O operations. 
Since the original UNIX file system, block buffers have been used to reduce the cost of UNIX file 
operations, in which partial block reads and writes occur frequently. By retaining a small cache of those 
frequently accessed buffers, the UNIX kernel could avoid unnecessary redundant I/O operations. The 
mechanisms of delayed writes avoid writing a block until the buffer is reused: read-aheads ensure that the 
successive block will be available by "prereading" it. 

The principle interfaces to the rest of the 386BSD kernel is through the procedures getblk(), bread(), 
breada(), bwrite(), bdwrite(), bawrite(), and brelse(). 

getblk() sifts through the buffer cache, looking for a matching buffer that it can make busy. Failing that, it 
allocates a new buffer out of those currently not busy, makes it busy, and returns it to its caller. The caller 
of getblkQ can tell if the contents were obtained from the cache or if the block needs to be filled because 
it’s not cached or valid. If getblk() ever needs to pause to wait for something to become available, it needs 
to restart its algorithm on the off chance that it is working with stale pointers. If getblk() attempts to 
contend for a free buffer, it needs to ensure unique access by issuing a splbio(), thus blocking out all 
asynchronous events that could intrude. bread() uses getblk() to obtain the appropriate buffer. If the 
contents of the buffer are not appropriate, it issues a logical read of the buffer, VOP_STRATEGY(), 
whereby the logical-to-physical mapping occurs and the I/O operation is passed to the driver. A wait for 
the operation to complete (biowait()) is then entered. With appropriate contents, the buffer is returned for 
unique access by caller. 

breada() is like bread(), but it overlaps the possible first read operation with the second read operation, in 
an attempt to force the read-ahead block into the cache, in anticipation of it being read by the process 
"soon." If the read-ahead block is already in the cache, then the block is merely moved to the tail of the 
LRU chain so it won’t be reused so soon. breada() is naive about cache flooding and relies on the wait for 
subsequent blocks being high because the blocks are not contiguous. Thus, its concept of "double 
buffering" works best in a file system with rotational delays (unlike, for example, a log-based file system). 

bwrite() accomplishes a synchronous write of a block obtained from one of the previously mentioned 
sources (or indirectly, such as through a delayed write). The only magic here, other than being almost 
symmetrical with bread(), is that delayed writes need to inform the vnode layer that they are no longer 
delayed. After the write completes, the block buffer is returned to the freelist, ready for others to use. 

bdwrite() does not actually do a write, it just marks the block and tells the vnode layer of its special 
significance. Tape drives can never have delayed (and possibly unordered) writes, so we enforce a 
synchronous write. 

bawrite() is much like the synchronous case; however, it marks the block to be released when output is 
finished using the ASYNC flag on the block. Note that the read-ahead block will also be released in the 
same manner when its read completes. 


216 



brelse() is how a block buffer is returned for use by the rest of the system. To prevent congestion, other 
processes waiting for this or any other block are notified of the change in state. The WANTED flags 
merely reduce the number of spurious wakeups that might otherwise be generated. We then categorize the 
block and put it on the appropriate list for reuse. (This is where buffer-cache policies are instituted.) It then 
is marked no longer busy, and may be reallocated. 

getblk() and others use incore() and getnewbufQ to do the dirty work of locating blocks in the cache and 
obtaining them. Buffers are allocated space with malloc(), and if the size changes, allocbufQ adjusts the 
size accordingly. The file systems themselves are responsible for upgrading the size of blocks that are 
cached, because new data might need to be read in. Note that as buffers gain and leave association with a 
given file (vnode pointer), they must inform the virtual file-system layer of the event. Likewise, a block 
buffer must gain and lose association with a given freelist (search for a freeblock), and hashlist (so it can 
be located by search for contents). 


biowait() is used to wait for a buffer to be finished with I/O, and biodone() signals the end of I/O to 
interested biowait() calls; see Listing Two , page 101. Both are used for dealing with the drivers. 
biodone(), in particular, needs to specially handle cases for the virtual memory system (B_CALL) and 
asynchronous I/O (deallocation). 


4.4BSD Demands 


The current BSD kernel uses the block cache quite differently from older versions. It is now a logical 
cache: Its contents are relative to a logical file rather than to a physical disk-sector address. As a result, the 
virtual file-system layer must translate between the two on demand, and do the necessary I/O operation, 

V OP_STRATEGY () 

Also, the vnode layer must track delayed writes with a list of dirty blocks per each vnode. bgetvpQ, 
brelvp(), and reassignbuf() track the assignment of blocks to clean and dirty block lists in each vnode. This 
information is consulted with a file commit (fsync()). Lile-system commit (sync()) operation is done by 
the file-system layers. 

Because the file system may work with variable block sizes, the buffer contents and sizing are actually the 
provenance of the file-system code. Surprisingly, the buffer cache has no knowledge of the scaling of the 
logical blocks it manages, and limited knowledge on the size of the cached block itself. (Only the file 
system associated with the block knows how much is valid at any time.) This makes it particularly 
difficult to advance the state-of-the-art of a file-system page cache. 

4.4BSD Weaknesses 


As a result, there are many weaknesses in this design. Lirst, it creates a synchronous logjam on I/O, as 
blocks are sequenced out to the disk driver. (Everything is done in small, synchronous transfers that don't 
allow the modern disk subsystems to maintain high data rates.) The buffer cache is privately managed 
space, so it competes for resources with the virtual memory system, with which it’s not on speaking terms. 
Worse yet, both tend to cache the same data, so they are at odds with duplicated effort and state 
information. Finally, the cache policies implied by the buffer cache usurp the kinds of policies a file 
system might wish to make. An example of this is when an NFS file server wishes to offer "leases" on 
buffered contents of file-system information to reduce client-server cache coherence cost. 


217 




Terminal Ring Buffers 


The final missing piece for an operable system is the code to manipulate the character buffers used by the 
tty driver, called "clists." clists are just another privately managed buffer mechanism that relies on the 
reallocation of a pool of small (32-byte) blocks of character data that can be viewed logically as a FIFO 
queue of characters. The tty driver uses these queues to implement a general-purpose terminal interface for 
consoles, serial ports, and network sessions, clists are an elegant mechanism to allow numerous terminal 
sessions to share a region of buffer memory, and they were ideal for a timesharing system with small 
memory and a large number of competing sessions. However, they are cumbersome to maintain and 
inconvenient for mass transfers (such as painting a bitmap screen). Therefore, we have written code to 
implement ring buffers in their place. These reduce the cost of character buffer management, especially in 
the mass transfer case. Instead of making an analogue of the interface of clists, we modified the BSD tty 
driver and related code to take advantage of large, contiguous buffer regions of characters that this 
approach afforded. 


Character-by-Character Operations 


For the drivers themselves, written with character-by-character code, we kept the traditional getc/putc 
operations and their inverse operations ungetc/unputc; see Listing Three| page 103. These work by means 
of successor and predecessor macros ( Listing Four , page 104) that topologically make a ring buffer’s 
data region contiguous. A side effect of this is that the operations and inverses are valid for any underlying 
method of storage. So if, for example, we wanted to use another buffering mechanism (such as BSD 
mbufs), we could do so just by modifying the macros. 


Block Operations 

Block operations are afforded by the contiguous transfer-length macros that allow inline code to 
manipulate ring-buffer contents in contiguous sections. This means that code to replace clist-to-block (and 
its inverse) must be generated on a case-by-case basis, but that code is exactly where the transfer rate 
bottleneck is anyway, so this is appropriate. 

This scheme requires more space for each active terminal session, because the blocks don't share buffer 
space but still have private ring-buffer contents. Also, at the moment, this code is faster than the buffering 
policies anticipate, so the higher layers of the tty driver suspend operation for too long, anticipating the 
usual transfer rate. As such, much work needs to be done to tune the system for this mechanism. 


386BSD: Other Portions Beyond Basic Operation 

With the code presented in this article and a list of trivial bug fixes available from DDJ, the system 
becomes bootable (using the MS-DOS bootstrap) and can rebuild itself. However, to be fully complete, 
two areas remain. One is the "raw" I/O facility for mass-storage devices, that allows block transfer directly 
to a user process. This is used in file-system integrity checks, and file-system dump and backup 
procedures. In addition, no user-process debugging can be done, because the process-tracing facility is not 
yet present (although process core dumps are available). 


218 




Lessons Learned 


We were loath to proceed in the manner outlined here because we ended up creating some 
backward-looking portions of code. We worried about the waste of time and loss of focus such a diversion 
might cause. However, in retrospect, it was the fastest way to clearly outline the problems to be considered 
while working on more grandiose or innovative schemes. Our enforced realism exposed many weaknesses 
lying dormant and taken for granted. 

William Saroyan once said that re-reading a good book was never a waste of time, because in every great 
work lay little things that had been unnoticed or forgotten. It seems that the same holds true for systems 
programming and design. 

[LISTING ONE] 

/* Copyright (c) 1989, 1990, 1991, 1992 William F. Jolitz, TeleMuse 

* All rights reserved. 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions 

* are met: 

* 1. Redistributions of source code must retain the above copyright 

* notice, this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright 

* notice, this list of conditions and the following disclaimer in the 

* documentation and/or other materials provided with the distribution. 

* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: 

* This software is a component of "386BSD" developed by 

* William F. Jolitz, TeleMuse. 

* 4. Neither the name of the developer nor the name "386BSD" may be used to 

* endorse or promote products derived from this software without specific 

* prior written permission. 

* THIS SOFTWARE IS A COMPONENT OF 386BSD DEVELOPED BY WILLIAM F. JOLITZ 

* AND IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. THIS 

* SOFTWARE SHOULD NOT BE CONSIDERED TO BE A COMMERCIAL PRODUCT. 

* THE DEVELOPER URGES THAT USERS WHO REQUIRE A COMMERCIAL PRODUCT 

* NOT MAKE USE OF THIS WORK. THIS SOFTWARE IS PROVIDED BY THE DEVELOPER 

* "AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 

* TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 

* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE DEVELOPER BE LIABLE FOR ANY 

* DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 

* LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 

* ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 

* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 

* 

* This procedure implements a minimal program execution facility for 

* 386BSD. It interfaces to the BSD kernel as the execve system call. 

* Significant limitations and lack of compatiblity with POSIX are 

* present with this version, to make its basic operation more clear. 

*/ 


#include "param.h" 
#include "systm.h" 


219 



♦include 

♦include 

♦include 

♦include 

♦include 

♦include 

♦include 

♦include 

♦include 

♦include 

♦include 


"proc.h" 
"mount.h" 
"namei.h" 
"vnode.h" 
"file.h" 

"exec.h" 

"stat.h" 

"wait.h" 

"signalvar.h" 
"mman.h" 
"malloc.h" 


♦include 

♦include 

♦include 

♦include 


"vm/vm.h" 

" vm / vm_p a r am. h " 
"vm/vm_map.h" 

"vm/vm_kern.h" 


♦include "machine/reg.h" 
extern int dostacklimits; 


/* execveO system call. */ 

/* ARGSUSED */ 
execve(p, uap, retval) 
struct proc *p; 
register struct args { 
char *fname; 

char **argp; 

char **envp; 

} *uap; 
int *retval; 

{ 

register struct nameidata *ndp; 
struct nameidata nd; 
struct exec hdr; 

char **argbuf, **argbufp, *stringbuf, *stringbufp; 
char **vectp, *ep; 

int needsenv, limitonargs, stringlen, addr, size, len, 
rv, amt, argc, tsize, dsize, bsize, cnt, foff; 
struct vattr attr; 
struct vmspace *vs; 
caddr_t newframe; 

/* Step 1. Lookup filename to see if we have something to execute. */ 
ndp = &nd; 

ndp->ni_nameiop = LOOKUP | LOCKLEAF | FOLLOW; 
ndp->ni_segflg = UIO_USERSPACE; 
ndp->ni_dirp = uap->fname; 

/* is it there? */ 
if (rv = namei(ndp, p)) 
return (rv); 

/* is it a regular file? */ 
if (ndp->ni_vp->v_type != VREG) { 
vput(ndp->ni_vp); 
return(ENOEXEC); 

} 

/* is it executable? */ 

rv = VOP_ACCESS(ndp->ni_vp, VEXEC, p->p_ucred, p); 
if (rv) 


220 



goto exec_fail; 

/* does it have any attributes? */ 

rv = VOP_GETATTR(ndp->ni_vp, &attr, p->p_ucred, p); 
if (rv) 

goto exec_fail; 

/* Step 2. Does file contain a format we can understand and execute */ 
rv = vn_rdwr(UIO_READ, ndp->ni_vp, (caddr_t)&hdr, sizeof(hdr), 

0, UIO_SYSSPACE, IO_NODELOCKED, p->p_ucred, &amt, p); 

/* big enough to hold a header? */ 
if (rv) 

goto exec_fail; 

/* ... that we recognize? */ 
rv = ENOEXEC; 

if (hdr.a_magic != ZMAGIC) 
goto exec_fail; 

/* sanity check "ain't not such thing as a sanity clause" -groucho */ 
if (hdr.a_text > MAXTSIZ 

[| hdr.a_text % NBPG || hdr.a_text > attr.va_size) 
goto exec_fail; 

if (hdr.a_data == 0 || hdr.a_data > DFLDSIZ 
j | hdr.a_data > attr.va_size 
[| hdr.a_data + hdr.a_text > attr.va_size) 
goto exec_fail; 
if (hdr.a_bss > MAXDSIZ) 
goto exec_fail; 

if (hdr.a_text + hdr.a_data + hdr.a_bss > MAXTSIZ + MAXDSIZ) 
goto exec_fail; 

/* Step 3. File and header are valid. Now, dig out the strings 

* out of the old process image. */ 

/* We implement a single-pass algorithm that builds a new stack 

* frame within the address space of the "old" process image, 

* avoiding the second pass entirely. Thus, the new frame is 

* in position to be run. This consumes much virtual address space, 

* and two pages more of 'real' memory, such are the costs. 

* [Also, note the cache wipe that's avoided!] */ 

/* create anonymous memory region for new stack */ 
vs = p->p_vmspace; 

if ((unsigned)vs->vm_maxsaddr + MAXSSIZ < USRSTACK) 
newframe = (caddr_t) USRSTACK - MAXSSIZ; 

else 

vs->vm_maxsaddr = newframe = (caddr_t) USRSTACK - 2*MAXSSIZ; 

/* don't do stack limit checking on traps temporarily */ 
dostacklimits = 0; 

rv = vm_allocate(&vs->vm_map, Snewframe, MAXSSIZ, FALSE); 
if (rv) goto exec_fail; 

/* allocate string buffer and arg buffer */ 
argbuf = (char **) (newframe + MAXSSIZ - 3*ARG_MAX); 
stringbuf = stringbufp = ((char *)argbuf) + 2*ARG_MAX; 
argbufp = argbuf; 

/* first, do args */ 
vectp = uap->argp; 
needsenv = 1; 
limitonargs = ARG_MAX; 
cnt = 0; 
do_env_as_well: 

if(vectp == 0) goto dont_bother; 

/* for each envp, copy in string */ 


221 



do { 


/* did we outgrow initial argbuf, if so, die */ 
if (argbufp == (char **)stringbuf) ( 

rv = E2BIG; 
goto exec_dealloc; 

} 

/* get an string pointer */ 
ep = (char *)fuword(vectp++); 
if (ep == (char *)-l) { 

rv = EFAULT; 
goto exec_dealloc; 

} 

/* if not a null pointer, copy string */ 
if (ep) { 

if (rv = copyinoutstr(ep, stringbufp, 

(u_int)limitonargs, (u_int *) Sstringlen)) { 

if (rv == ENAMETOOLONG) 
rv = E2BIG; 
goto exec_dealloc; 

} 

suword(argbufp++, (int)stringbufp); 
cnt++; 

stringbufp += stringlen; 
limitonargs -= stringlen; 

} else { 

suword(argbufp++, 0); 
break; 

} 

} while (limitonargs > 0); 
dont_bother: 

if (limitonargs <= 0) { 

rv = E2BIG; 
goto exec_dealloc; 

} 

/* have we done the environment yet ? */ 
if (needsenv) { 

/* remember the arg count for later */ 

argc = cnt; 

vectp = uap->envp; 

needsenv = 0; 

goto do_env_as_well; 

} 

/* At this point, one could optionally implement a second pass to 

* condense strings, arguement vectors, and stack to fit fewest pages. 

* One might selectively do this when copying was cheaper 

* than leaving allocated two more pages per process. */ 

/* stuff arg count on top of "new" stack */ 

argbuf[-1] = (char *)argc; 

/* Step 4. Build the new processes image. At this point, we are 

* committed — destroy old executable! */ 

/* blow away all address space, except the stack */ 

rv = vm_deallocate(&vs->vm_map, 0, USRSTACK - 2*MAXSSIZ, FALSE); 

if (rv) 

goto exec_abort; 

/* destroy "old" stack */ 

if ((unsigned)newframe < USRSTACK - MAXSSIZ) { 

rv = vm_deallocate(&vs->vm_map, USRSTACK - MAXSSIZ, MAXSSIZ, 


222 



FALSE); 
if (rv) 

goto exec_abort; 

} else { 

rv = vm_deallocate(&vs->vm_map, USRSTACK - 2*MAXSSIZ, MAXSSIZ, 
FALSE); 
if (rv) 

goto exec_abort; 

} 

/* build a new address space */ 
addr = 0; 

/* screwball mode — special case of 413 to save space for floppy */ 
if (hdr.a_text == 0) { 

foff = tsize = 0; 
hdr.a_data += hdr.a_text; 

} else { 

tsize = roundup(hdr.a_text, NBPG); 
foff = NBPG; 

} 

/* treat text and data in terms of integral page size */ 

dsize = roundup(hdr.a_data, NBPG); 

bsize = roundup(hdr.a_bss + dsize, NBPG); 

bsize -= dsize; 

/* map text & data in file, as being "paged in" on demand */ 
rv = vm_mmap(&vs->vm_map, &addr, tsize+dsize, VM_PROT_ALL, 

MAP_FILEIMAP_COPY|MAP_FIXED, (caddr_t)ndp->ni_vp, foff); 
if (rv) 

goto exec_abort; 

/* mark pages r/w data, r/o text */ 
if (tsize) { 
addr = 0; 

rv = vm_protect(&vs->vm_map, addr, tsize, FALSE, 

VM_PROT_READ|VM_PROT_EXECUTE); 
if (rv) 

goto exec_abort; 

} 

/* create anonymous memory region for bss */ 
addr = dsize + tsize; 

rv = vm_allocate(&vs->vm_map, &addr, bsize, FALSE); 
if (rv) 

goto exec_abort; 

/* Step 5. Prepare process for execution. */ 

/* touchup process information — vm system is unfinished! */ 
vs->vm_tsize = tsize/NBPG; /* text size (pages) XXX */ 

vs->vm_dsize = (dsize+bsize)/NBPG; /* data size (pages) XXX */ 
vs->vm_taddr =0; /* user virtual address of text XXX */ 

vs->vm_daddr = (caddr_t)tsize; /* user virtual address of data XXX */ 
vs->vm_maxsaddr = newframe; /* user VA at max stack growth XXX */ 
vs->vm_ssize = ((unsigned)vs->vm_maxsaddr + MAXSSIZ 

- (unsigned)argbuf)/ NBPG + 1; /* stack size (pages) */ 
dostacklimits = 1; /* allow stack limits to be enforced XXX */ 

/* close files on exec, fixup signals */ 
fdcloseexec(p); 
execsigs(p); 

/* setup initial register state */ 
p->p_regs[SP] = (unsigned) (argbuf - 1); 
setregs(p, hdr.a_entry); 


223 



vput(ndp->ni_vp) ; 
return (0) ; 
exec_dealloc: 

/* remove interim "new" stack frame we were building */ 
vm_deallocate(&vs->vm_map, newframe, MAXSSIZ, FALSE); 
exec_fail: 

dostacklimits = 1; 
vput(ndp->ni_vp); 
return(rv); 
exec_abort: 

/* sorry, no more process anymore, exit gracefully */ 
vm_deallocate(&vs->vm_map, newframe, MAXSSIZ, FALSE); 
vput(ndp->ni_vp) ; 
exit(p, W_EXITCODE(0, SIGABRT) ) ; 

/* NOTREACHED */ 
return(0); 

} 


[LISTING TWO] 


/* Copyright (c) 1992 William Jolitz. All rights reserved. 

* Written by William Jolitz 1/92 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions 

* are met: 

* 1. Redistributions of source code must retain the above copyright 

* notice, this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright 

* notice, this list of conditions and the following disclaimer in the 

* documentation and/or other materials provided with the distribution. 

* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: 

* This software is a component of "386BSD" developed by 
William F. Jolitz, TeleMuse. 

* 4. Neither the name of the developer nor the name "386BSD" 

* may be used to endorse or promote products derived from this software 

* without specific prior written permission. 

* THIS SOFTWARE IS A COMPONENT OF 386BSD DEVELOPED BY WILLIAM F. JOLITZ 

* AND IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. THIS SOFTWARE 

* SHOULD NOT BE CONSIDERED TO BE A COMMERCIAL PRODUCT. THE DEVELOPER URGES 

* THAT USERS WHO REQUIRE A COMMERCIAL PRODUCT NOT MAKE USE OF THIS WORK. THIS 

* SOFTWARE IS PROVIDED BY THE DEVELOPER "AS IS'' AND ANY EXPRESS OR IMPLIED 

* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 

* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 

* EVENT SHALL THE DEVELOPER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 

* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 

* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 

* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 

* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 

* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 

* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 

* 


* Block 
♦include 
♦include 
♦include 
♦include 


I/O Cache mechanism, 
"param.h" 

"proc.h" 

"vnode.h" 

"buf.h" 


ala malloc(). 


*/ 


224 



#include "specdev.h" 

#include "mount.h" 

♦include "malloc.h" 

♦include "resourcevar.h" 

/* Initialize buffer headers and related structures. */ 
void bufinit() 

{ 

struct bufhd *bh; 
struct buf *bp; 

/* first, make a null hash table */ 
for(bh = bufhash; bh < bufhash + BUFHSZ; bh++) { 

bh->b_flags = 0; 
bh->b_forw = (struct buf *)bh; 
bh->b_back = (struct buf *)bh; 

} 

/* next, make a null set of free lists */ 
for(bp = bfreelist; bp < bfreelist + BQUEUES; bp++) { 

bp->b_flags = 0; 
bp->av_forw = bp; 
bp->av_back = bp; 
bp->b_forw = bp; 
bp->b_back = bp; 

} 

/* finally, initialize each buffer header and stick on empty q */ 
for(bp = buf; bp < buf + nbuf ; bp++) { 

bp->b_flags = B_HEAD | B_INVAL; /* we're just an empty header */ 
bp->b_dev = NODEV; 
bp->b_vp = 0; 

binstailfree(bp, bfreelist + BQ_EMPTY); 
binshash(bp, bfreelist + BQ_EMPTY); 

} 

} 

/* Find the block in the buffer pool. If buffer is not present, allocate a new 
* buffer and load its contents according to the filesystem fill routine. */ 
bread(vp, blkno, size, cred, bpp) 
struct vnode *vp; 
daddr_t blkno; 
int size; 

struct ucred *cred; 
struct buf **bpp; 

{ 

struct buf *bp; 
int rv = 0; 

bp = getblk (vp, blkno, size); 

/* if not found in cache, do some I/O */ 

if ((bp->b_flags & B_CACHE) == 0 || (bp->b_flags & B_INVAL) != 0) { 

bp->b_flags |= B_READ; 

bp->b_flags &= ~(B_DONEIB_ERROR|B_INVAL); 

VOP_STRATEGY(bp); 
rv = biowait (bp) ; 

} 

*bpp = bp; 
return (rv); 

} 

/* Operates like bread, but also starts I/O on the specified read-ahead block. 


225 



* [See page 55 of Bach's Book] */ 

breada(vp, blkno, size, rablkno, rabsize, cred, bpp) 
struct vnode *vp; 
daddr_t blkno; int size; 
daddr_t rablkno; int rabsize; 
struct ucred *cred; 
struct buf **bpp; 

{ 

struct buf *bp, *rabp; 

int rv = 0, needwait = 0; 

bp = getblk (vp, blkno, size); 

/* if not found in cache, do some I/O */ 

if ((bp->b_flags & B_CACHE) == 0 || (bp->b_flags & B_INVAL) != 0) { 

bp->b_flags |= B_READ; 

bp->b_flags &= ~(B_DONEIB_ERROR|B_INVAL); 

VOP_STRATEGY(bp); 
needwait++; 

} 

rabp = getblk (vp, rablkno, rabsize); 

/* if not found in cache, do some I/O (overlapped with first) */ 
if ((rabp->b_flags & B_CACHE) == 0 | | (rabp->b_flags & B_INVAL) != 0) { 
rabp->b_flags |= B_READ | B_ASYNC; 
rabp->b_flags &= ~(B_DONEIB_ERRORIB_INVAL); 

VOP_STRATEGY(rabp); 

} else 

brelse(rabp); 

/* wait for original I/O */ 
if (needwait) 

rv = biowait (bp) ; 

*bpp = bp; 
return (rv); 

} 

/* Synchronous write. Release buffer on completion. */ 
bwrite(bp) 

register struct buf *bp; 

{ 

int rv; 

if(bp->b_flags & B_INVAL) { 
brelse(bp); 
return (0); 

} else { 

int wasdelayed; 

wasdelayed = bp->b_flags & B_DELWRI; 

bp->b_flags &= ~(B_READIB_DONEIB_ERROR|B_ASYNC|B_DELWRI) ; 
if(wasdelayed) reassignbuf(bp, bp->b_vp); 
bp->b_flags |= B_DIRTY; 

VOP_STRATEGY(bp); 
rv = biowait(bp); 
if (!rv) 

bp->b_flags &= ~B_DIRTY; 
brelse(bp); 
return (rv); 

} 

} 

/* Delayed write. The buffer is marked dirty, but is not queued for I/O. This 

* routine should be used when the buffer is expected to be modified again 

* soon, typically a small write that partially fills a buffer. NB: magnetic 


226 



* tapes can't be delayed; must be written in order writes are requested. */ 
void bdwrite(bp) 

register struct buf *bp; 

{ 

if(bp->b_flags & B_INVAL) 
brelse(bp); 

if(bp->b_flags & B_TAPE) ( 
bwrite(bp); 
return; 

} 

bp->b_flags &= ~(B_READIB_DONE); 
bp->b_flags I= B_DIRTYIB_DELWRI; 
reassignbuf(bp, bp->b_vp); 
brelse(bp); 
return; 

} 

/* Asynchronous write. Start I/O on a buffer, but do not wait for it to 

* complete. The buffer is released when the I/O completes. */ 
bawrite(bp) 

register struct buf *bp; 

{ 

if(!(bp->b_flags & B_BUSY))panic("bawrite: not busy"); 
if(bp->b_flags & B_INVAL) 
brelse(bp); 
else { 

int wasdelayed; 

wasdelayed = bp->b_flags & B_DELWRI; 
bp->b_flags &= ~(B_READIB_DONE|B_ERROR|B_DELWRI); 
if(wasdelayed) reassignbuf(bp, bp->b_vp); 
bp->b_flags |= B_DIRTY | B_ASYNC; 

VOP_STRATEGY(bp); 

} 

} 

/* Release a buffer. Even if the buffer is dirty, no I/O is started. */ 
brelse(bp) 

register struct buf *bp; 

{ 

int x; 

/* anyone need a "free" block? */ 
x=splbio(); 

if ( (bfreelist + BQ_AGE)->b_flags & B_WANTED) { 

(bfreelist + BQ_AGE) ->b_flags &= ~B_WANTED; 
wakeup(bfreelist) ; 

} 

/* anyone need this very block? */ 
if (bp->b_flags & B_WANTED) { 
bp->b_flags &= ~B_WANTED; 
wakeup(bp); 

} 

if (bp->b_flags & (B_INVAL|B_ERROR)) { 

bp->b_flags |= B_INVAL; 
bp->b_flags &= ~(B_DELWRIIB_CACHE); 
if(bp->b_vp) 

brelvp(bp); 

} 

/* enqueue */ 

/* buffers with junk contents */ 


227 



if(bp->b_flags & (B_ERRORIB_INVALIB_NOCACHE)) 
binsheadfree(bp, bfreelist + BQ_AGE) 

/* buffers with stale but valid contents */ 
else if(bp->b_flags & B_AGE) 

binstailfree (bp, bfreelist + BQ_AGE) 

/* buffers with valid and quite potentially reuseable contents */ 
else 

binstailfree(bp, bfreelist + BQ_LRU) 

/* unlock */ 
bp->b_flags &= ~B_BUSY; 
splx(x); 
return; 

} 

int freebufspace = 20*NBPG; 
int allocbufspace; 

/* Find a buffer which is available for use. If free memory for buffer space 

* and an empty header from the empty list, use that. Otherwise, select 

* something from a free list. Preference is to AGE list, then LRU list. */ 
struct buf * 

getnewbuf(sz) 

{ 

struct buf *bp; 
int x; 

x = splbio(); 
start: 

/* can we constitute a new buffer? */ 
if (freebufspace > sz 

&& bfreelist[BQ_EMPTY].av_forw != (struct buf *)bfreelist+BQ_EMPTY) { 
caddr_t addr; 

if ((addr = malloc (sz, M_TEMP, M_NOWAIT)) == 0) goto tryfree; 

freebufspace -= sz; 

allocbufspace += sz; 

bp = bfreelist[BQ_EMPTY].av_forw; 

bp->b_flags = B_BUSY | B_INVAL; 

bremfree(bp); 

bp->b_un.b_addr = (caddr_t) addr; 
goto fillin; 

} 

tryfree: 

if (bfreelist[BQ_AGE],av_forw != (struct buf *)bfreelist+BQ_AGE) { 
bp = bfreelist[BQ_AGE].av_forw; 
bremfree(bp); 

} else if (bfreelist[BQ_LRU].av_forw != (struct buf *)bfreelist+BQ_LRU) { 
bp = bfreelist[BQ_LRU].av_forw; 
bremfree(bp); 

} else { 

/* wait for a free buffer of any kind */ 

(bfreelist + BQ_AGE)->b_flags |= B_WANTED; 
sleep(bfreelist, PRIBIO); 
splx(x); 
return (0); 

} 

/* if we are a delayed write, convert to an async write! */ 
if (bp->b_flags & B_DELWRI) { 
bp->b_flags |= B_BUSY; 
bawrite (bp); 
goto start; 


228 



} 

if(bp->b_vp) 

brelvp(bp); 

/* we are not free, nor do we contain interesting data */ 
bp->b_flags = B_BUSY; 
fillin: 

bremhash(bp); 
splx(x); 

bp->b_dev = NODEV; 

bp->b_vp = NULL; 

bp->b_blkno = bp->b_lblkno = 0; 

bp->b_iodone = 0; 

bp->b_error = 0; 

bp->b_wcred = bp->b_rcred = NOCRED; 
if (bp->b_bufsize != sz) allocbuf(bp, sz); 
bp->b_bcount = bp->b_bufsize = sz; 
bp->b_dirtyoff = bp->b_dirtyend = 0; 
return (bp); 

} 

/* Check to see if a block is currently memory resident. */ 
struct buf ‘incore(vp, blkno) 
struct vnode *vp; 
daddr_t blkno; 

{ 

struct buf *bh; 
struct buf *bp; 
bh = BUFHASH(vp, blkno); 

/* Search hash chain */ 
bp = bh->b_forw; 

while (bp != (struct buf *) bh) { 

/* hit */ 

if (bp->b_lblkno == blkno && bp->b_vp == vp 
&& (bp->b_flags & B_INVAL) == 0) 
return (bp); 
bp = bp->b_forw; 

} 

return(0); 

} 

/* Get a block of requested size that is associated with a given vnode and 

* block offset. If it is found in block cache, mark it as found, make it busy 

* and return it. Otherwise, return empty block of the correct size. It is up 

* to caller to insure that the cached blocks be of the correct size. */ 
struct buf * 

getblk(vp, blkno, size) 

register struct vnode *vp; 
daddr_t blkno; 
int size; 

{ 

struct buf *bp, *bh; 
int x; 
for (;;) ( 

if (bp = incore(vp, blkno)) { 
x = splbio (); 

if (bp->b_flags & B_BUSY) { 
bp->b_flags |= B_WANTED; 
sleep (bp, PRIBIO); 
continue; 


229 



} 

bp->b_flags |= B_BUSY | B_CACHE; 

bremfree(bp); 

if (size > bp->b_bufsize) 

panic("now what do we do?"); 

} else { 

if((bp = getnewbuf(size)) == 0) continue; 

bp->b_blkno = bp->b_lblkno = blkno; 

bgetvp(vp, bp); 

x = splbio(); 

bh = BUFHASH(vp, blkno); 

binshash(bp, bh); 

bp->b_flags = B_BUSY; 

} 

splx(x); 
return (bp); 

} 

} 

/* Get an empty, disassociated buffer of given size. */ 
struct buf * 
geteblk(size) 
int size; 

{ 

struct buf *bp; 
int x; 

while ((bp = getnewbuf(size)) == 0) 

r 

x = splbio(); 

binshash(bp, bfreelist + BQ_AGE); 
splx(x); 
return (bp); 

} 

/* Exchange a buffer's underlying buffer storage for one of different size, 

* taking care to maintain contents appropriately. When buffer increases in 

* size, caller is responsible for filling out additional contents. When buffer 

* shrinks in size, data is lost, so caller must first return it to backing 

* store before shrinking the buffer, as no implied I/O will be done. 

* Expanded buffer is returned as value. */ 
struct buf * 

allocbuf(bp, size) 

register struct buf *bp; 
int size; 

{ 

caddr_t newcontents; 

/* get new memory buffer */ 

newcontents = (caddr_t) malloc (size, M_TEMP, M_WAITOK); 

/* copy the old into the new, up to the maximum that will fit */ 
bcopy (bp->b_un.b_addr, newcontents, min(bp->b_bufsize, size)); 

/* return old contents to free heap */ 
free (bp->b_un.b_addr, M_TEMP); 

/* adjust buffer cache's idea of memory allocated to buffer contents */ 
freebufspace -= size - bp->b_bufsize; 
allocbufspace += size - bp->b_bufsize; 

/* update buffer header */ 
bp->b_un.b_addr = newcontents; 
bp->b_bcount = bp->b_bufsize = size; 
return(bp); 


230 



} 

/* Patiently await operations to complete on this buffer. When they do, 

* extract error value and return it. Extract and return any errors associated 

* with the I/O. If an invalid block, force it off the lookup hash chains. */ 
biowait(bp) 

register struct buf *bp; 

{ 

int x; 

x = splbio(); 

while ((bp->b_flags & B_DONE) == 0) 
sleep((caddr_t)bp, PRIBIO) ; 
if((bp->b_flags & B_ERROR) || bp->b_error) { 
if ((bp->b_flags & B_INVAL) == 0) { 

bp->b_flags |= B_INVAL; 
bremhash(bp) ; 

binshash(bp, bfreelist + BQ_AGE); 

} 

if (!bp->b_error) 

bp->b_error = EIO; 

else 

bp->b_flags |= B_ERROR; 
splx(x); 

return (bp->b_error); 

} else { 

splx(x); 
return (0); 

} 

} 

/* Finish up operations on a buffer, calling an optional function (if 

* requested), and releasing the buffer if marked asynchronous. Then mark this 

* buffer done so that others biowait()'ing for it will notice when they are 

* woken up from sleep(). */ 
biodone(bp) 

register struct buf *bp; 

{ 

int x; 

x = splbio(); 

if (bp->b_flags & B_CALL) (*bp->b_iodone)(bp); 

bp->b_flags &= ~B_CALL; 

if (bp->b_flags & B_ASYNC) brelse (bp); 

bp->b_flags &= ~B_ASYNC; 

bp->b_flags |= B_DONE; 

wakeup(bp); 

splx(x); 

} 

[LISTING THREE] 

/* Copyright (c) 1992 William F. Jolitz. All rights reserved. 

* Written by William Jolitz 1/92 

* Redistribution and use in source and binary forms, with or without 

* modification, are permitted provided that the following conditions 

* are met: 1. Redistributions of source code must retain the above copyright 

* notice, this list of conditions and the following disclaimer. 

* 2. Redistributions in binary form must reproduce the above copyright 

* notice, this list of conditions and the following disclaimer in the 

* documentation and/or other materials provided with the distribution. 


231 



* 3. All advertising materials mentioning features or use of this software 

* must display the following acknowledgement: 

* This software is a component of "386BSD" developed by 
William F. Jolitz, TeleMuse. 

* 4. Neither the name of the developer nor the name "386BSD" 

* may be used to endorse or promote products derived from this software 

* without specific prior written permission. 

* THIS SOFTWARE IS A COMPONENT OF 386BSD DEVELOPED BY WILLIAM F. JOLITZ 

* AND IS INTENDED FOR RESEARCH AND EDUCATIONAL PURPOSES ONLY. THIS SOFTWARE 

* SHOULD NOT BE CONSIDERED TO BE A COMMERCIAL PRODUCT. THE DEVELOPER URGES 

* THAT USERS WHO REQUIRE A COMMERCIAL PRODUCT NOT MAKE USE OF THIS WORK. THIS 

* SOFTWARE IS PROVIDED BY THE DEVELOPER "AS IS'' AND ANY EXPRESS OR IMPLIED 

* WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF 

* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO 

* EVENT SHALL THE DEVELOPER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 

* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 

* PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; 

* OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 

* WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR 

* OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 

* ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 

* 

* Ring Buffer code for 386BSD. */ 

#include "param.h" 

#include "systm.h" 

#include "buf.h" 

#include "ioctl.h" 

♦include "tty.h" 

putc(c, rbp) struct ringb *rbp; 

{ 

char *nxtp; 

/* ring buffer full? */ 

if ( (nxtp = RB_SUCC(rbp, rbp->rb_tl)) == rbp->rb_hd) return (-1); 

/* stuff character */ 

*rbp->rb_tl = c; 
rbp->rb_tl = nxtp; 
return(0); 

} 

getc(rbp) struct ringb *rbp; 

{ 

u_char c; 

/* ring buffer empty? */ 

if (rbp->rb_hd == rbp->rb_tl) return(-l); 

/* fetch character, locate next character */ 
c = * (u_char *) rbp->rb_hd; 
rbp->rb_hd = RB_SUCC(rbp, rbp->rb_hd); 
return (c); 

} 

nextc(cpp, rbp) struct ringb *rbp; char **cpp; { 
if (*cpp == rbp->rb_tl) return (0); 
else { char *cp; 
cp = *cpp; 

*cpp = RB_SUCC(rbp, cp); 
return(*cp); 

} 

} 


232 



ungetc(c, rbp) struct ringb *rbp; 

{ 

char ‘backp; 

/* ring buffer full? */ 

if ( (backp = RB_PRED(rbp, rbp->rb_hd)) == rbp->rb_tl) return (-1) 
rbp->rb_hd = backp; 

/* stuff character */ 

*rbp->rb_hd = c; 
return(0) ; 

} 

unputc(rbp) struct ringb *rbp; 

{ 

char ‘backp; 

int c; 

/* ring buffer empty? */ 

if (rbp->rb_hd == rbp->rb_tl) return(-l); 

/* backup buffer and dig out previous character */ 
backp = RB_PRED(rbp, rbp->rb_tl); 
c = *(u_char *)backp; 
rbp->rb_tl = backp; 
return(c); 

} 

#define peekc(rbp) (*(rbp)->rb_hd) 
initrb(rbp) struct ringb *rbp; ( 

rbp->rb_hd = rbp->rb_tl = rbp->rb_buf; 

} 

/* Example code for contiguous operations: 

nc = RB_CONTIGPUT(&rb); 
if (nc) { 

if (nc >9) nc = 9; 

bcopy("ABCDEFGHI", rb.rb_tl, nc); 
rb.rb_tl += nc; 

rb.rb_tl = RB_ROLLOVER(&rb, rb.rb_tl); 

} 


nc = RB_CONTIGGET(&rb); 
if (nc) { 

if (nc > 79) nc = 79; 

bcopy(rb.rb_hd, stringbuf, nc); 

rb.rb_hd += nc; 

rb.rb_hd = RB_ROLLOVER(&rb, rb.rb_hd); 
stringbuf[nc] = 0; 
printf ( "%s | " , stringbuf); 

} 

*/ 

/* Concatinate ring buffers. */ 
catb(from, to) 

struct ringb ‘from, ‘to; 

{ 

char c; 

while ((c = getc(from)) >= 0) 
putc(c, to); 

} 


233 



[LISTING FOUR] 


/* [Excerpted from tty.h, 386BSD Release 0.0 - wfj] */ 

/* Ring buffers provide a contiguous, dense storage for character data used 
* by the tty driver. */ 

#define RBSZ 1024 
struct ringb { 

char *rb_hd; /* head of buffer segment to be read */ 

char *rb_tl; /* tail of buffer segment to be written */ 

char rb_buf[RBSZ]; /* segment contents */ 

}; 

#define RB_SUCC(rbp, p) \ 

((p) >= (rbp)->rb_buf + RBSZ - 1 ? (rbp)->rb_buf : (p) + 1) 

#define RB_ROLLOVER(rbp, p) \ 

((p) > (rbp)->rb_buf + RBSZ - 1 ? (rbp)->rb_buf : (p)) 

#define RB_PRED(rbp, p) \ 

((p) <= (rbp)->rb_buf ? (rbp)->rb_buf + RBSZ - 1 : (p) - 1) 

#define RB_LEN(rp) \ 

((rp)->rb_hd <= (rp)->rb_tl ? (rp)->rb_tl - (rp)->rb_hd : \ 

RBSZ - ((rp)->rb_hd - (rp)->rb_tl)) 

♦define RB_CONTIGPUT(rp) \ 

(RB_PRED(rp, (rp)->rb_hd) < (rp)->rb_tl ? \ 

(rp)->rb_buf + RBSZ - (rp)->rb_tl : \ 

RB_PRED(rp, (rp)->rb_hd) - (rp)->rb_tl) 

♦define RB_CONTIGGET(rp) \ 

((rp)->rb_hd <= (rp)->rb_tl ? (rp)->rb_tl - (rp)->rb_hd : \ 

(rp)->rb_buf + RBSZ - (rp)->rb_hd) 

Porting Unix To The 386: The Final Step 

Overview of the impact of Release 0.0 on the BSD community. Installation procedures for and highlights 
of Release 0.1. Final installment of series. 

Running light with 386BSD 

William Frederick Jolitz and Lynne Greer Jolitz 

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National 
Semiconductor’s GENIX project, the first virtual-memory, microprocessor-based UNIX system. Prior to 
establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric 
Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or 
comments to ljolitz@cardio.ucsf.edu. (c) 1992 TeleMuse. 

Over the past couple of months, we’ve discussed the minimal code and methodology required to fill in the 
missing pieces of the incomplete Net/2 tape, leading to an operational 386BSD kernel. Still, the step from 
a few changes to a real system is great, as evident by the items that had to be provided with this kernel: 
bootstraps, file systems, an installation mechanism, binaries of utilities, and documentation. 

However, we decided it was time to step away from the kernel and make all of 386BSD available and 
accessible, so that it could become the generic research and educational platform we envisioned when we 
wrote "386BSD: A Modest Proposal," the software specification for 386BSD, in mid-1989. So, on March 


234 



17, 1992 we launched 386BSD Release 0.0. 


386BSD Release 0.0--Liftoff 

386BSD Release 0.0 consisted of: 

• One distribution installation floppy. 

• One 8-floppy multivolume compressed TAR-format source distribution (31 Mbytes uncompressed). 

• One 6-floppy multivolume compressed TAR-format binary distribution (28 Mbytes uncompressed). 

• One 360K MS-DOS difference floppy (for those who want to do all the work themselves). 

• Release notes regarding installation procedures, manifests, and registration/bug report forms. All of 
386BSD Release 0.0 was released under freely redistributable and modifiable terms (with attribution 
to the authors maintained—a small and reasonable request for so much). 

With the assistance of several dedicated network volunteers (among them John Sokel, Dan Kionka, and 
members of the Silicon Valley Computer Society), 386BSD Release 0.0 was made widely available via 
the Internet. Within one month, an estimated 100,000 sites had obtained 386BSD Release 0.0. (Several 
networks, particularly in Australia, melted down over the transmission traffic and had to be regulated.) 

The enthusiastic response from Internet, BBSs, and various user groups (through which copies were 
widely distributed) has far exceeded our expectations. 

Another pleasant surprise was the number of software contributions, bug fixes, and suggestions from early 
users. People were eager not only to supply their code and knowledge, but to help others get their systems 
running too. 

Finally, it was gratifying to learn that our little system, bugs and all, was still capable of being complete 
enough to be used for the rest of its own development. There’s little question that, in less than a month, 
386BSD Release 0.0 was an unqualified success. 

386BSD Release 0.1--The Second Stage 

386BSD Release 0.1 is the most recent release as of this writing. It consists of: 

• A single distribution installation floppy, referred to as "Tiny 386BSD." 

• One 15-floppy multivolume compressed TAR-format source distribution. 

• One 10-floppy multivolume compressed TAR-format binary distribution. 

• Installation notes, manifests, and registration/bug report forms. 

The major Internet sites from which you can download 386BSD 0.1 are agate.berkeley.edu and 
reyes.stanford.edu. Additionally, we’re available to answer questions and provide some support on 
CompuServe (CIS#76703,4266) and in the UNIX/BSD conference on Bix. 

For limited program development, a 386/486 system should contain at least two Mbytes of RAM and a 
40-Mbyte hard disk. To make full use of the source tree and generate new software distributions, a 
200-Mbyte disk is recommended. (You can also use 386BSD’s version of NFS to obtain space via the 
Ethernet off a central shared server.) Performance obviously improves with faster processors and more 
memory. 


235 



Just in case anyone is wondering, the entire 386BSD system was created on a 386SX laptop with three 
Mbytes of RAM and one 100-Mbyte disk. 

What’s in 386BSD Release 0.1 

Thanks to many knowledgeable users, 386BSD 0.1 is a more robust version of 386BSD, supporting 
broader combinations of PC hardware and simpler installation procedures. 386BSD also provides for more 
features and functionality. 

386BSD 0.1 also contains many utilities which can be used in development work (see Table 1 ), including 
a C compiler, C++ compiler, loader, network protocol family (TCP/IP), and so forth. 386BSD also 
contains a complete set of Internet networking facilities (including NFS), a program development 
environment for gigabyte-sized programs, document-preparation and text-editing tools, and database 
mechanisms. And finally, it can rebuild itself from its own source tree. 

Table 1: 386BSD Release 0.1 utilities. 


apropos 

env 

lpr 

pwd 

timed 

ar 

eqn 

lprm 

query 

timedc 

arp 

expand 

lptest 

ranlib 

tip 

as 

expr 

Is 

rep 

tn3270 

badl44 

false 

m4 

rdist 

touch 

badsect 

find 

machine 

rdump 

tput 

basename 

finger 

mail 

reboot 

tr 

biff 

fmt 

mailstats 

renice 

trace 

cal 

fold 

make 

restore 

traceroute 

calendar 

from 

man 

rlogin 

trof f 

cat 

f sck 

mesg 

rm 

true 

cc 

f stat 

mkdep 

rmail 

tsort 

checknr 

ftp 

mkdir 

rmdir 

tty 

chgrp 

g++ 

mk fifo 

rmt 

tunefs 

chmod 

gcc 

mknod 

route 

ul 

chown 

gdb 

mkstr 

routed 

umount 

chpass 

genclass 

more 

rrestore 

uncompress 

chroot 

grep 

mount 

rsh 

unexpand 

cksum 

grof f 

mountd 

savecore 

unifdef 

clear 

grops 

mset 

script 

uniq 

clri 

grotty 

mtree 

sed 

unvis 

cmp 

groups 

mv 

sh 

update 

col 

halt 

named 

shar 

uptime 

colcrt 

head 

newf s 

showmount 

users 

colrm 

hexdump 

nf sd 

shutdown 

uudecode 

column 

hostname 

nfsiod 

size 

uuencode 

comm 

id 

nfsstat 

slattach 

vacation 

compress 

ifconfig 

nice 

sleep 

vipw 

config 

inetd 

nld 

sliplogin 

vis 

cp 

init 

nm 

soelim 

w 

cpio 

install 

nohup 

split 

wall 

cpp 

kdump 

nrof f 

strings 

wc 

csh 

kill 

nslookup 

strip 

what 

ctags 

ktrace 

nsquery 

stty 

whatis 

cu 

last 

nstest 

su 

whereis 

cut 

Id 

old 

swapon 

which 

date 

leave 

pac 

symorder 

who 


236 



dd 

lex 

pagesize 

sync 

whoami 

df 

In 

passwd 

syslogd 

who is 

dirname 

locate 

paste 

talk 

write 

disklabel 

lock 

pic 

tar 

xargs 

diskpart 

logger 

ping 

tbl 

xstr 

du 

login 

portmap 

tee 

yacc 

dump 

logname 

printenv 

telnet 

yes 

dump f s 

lpc 

printf 

test 

yyf ix 

echo 

lpd 

ps 

tftp 

zcat 

elvis 

lpq 

psroff 

time 



Qualifying a PC to Run 386BSD 

Release 0.1 can be ran in as little as a single floppy disk, using the Tiny 386BSD diskette available 
through the DDJ Careware Project. Send us a formatted, error-free, high-density 3.5 or 5.25 floppy 
diskette and an addressed, stamped diskette mailer, in care of: Tiny 386BSD, Dr. Dobb's Journal, 411 
Borel Ave., San Mateo, CA 94402, and we’ll send you the latest copy. There’s no charge, but if you want 
to slip in a dollar or so to help out the kids at the Children’s Support League of the East Bay, we know 
they’d appreciate it. (You can also obtain this software directly from the sites mentioned above.) In 
addition to experimenting with a very minimal 386BSD system prior to loading any software on the hard 
disk. Tiny 386BSD allows you to validate 386BSD operation on a PC. 

Simply insert the floppy into the drive and boot up the PC. If it boots and prompts you for a shell 
command (#), you’re ready for installation of the rest of the system. If it fails, it’s time to skull out the PC 
configuration, jumpers, BIOS setup menu, and all the other little "bright spots" that make for interesting 
compatibility problems. The general rule here is to isolate the problem by comparing cases that work with 
those that don’t until an explanation can be formed. 

In general, nifty hardware features are the source of most compatibility problems, along with bizarre 
hardware combinations that are "on the edge." Both should be avoided or defeated when they are behaving 
suspiciously. Mainstream hardware from patient and understanding firms helps a great deal. 

Another problem arises with non-standard "old" equipment that you just happen to have lying around. Do 
yourself a favor and leave these particular PCs to MS-DOS, since other hidden surprises possibly await. In 
short, avoid the "pathological" cases wherever possible. 

Installing 386BSD Release 0.1 

To install the rest of 386BSD, we must allocate a large portion (greater than or equal to 40 Mbytes) of 
formatted disk space on a hard disk—either the entire contents of a disk drive or the remaining contents of 
a disk drive after other systems have been partitioned. The process discussed here is an exact duplicate of 
the mechanism we devised five years ago for system-installation procedures on Symmetric Computer 
Systems’ 375 computers. (This was contributed to Berkeley, and parts appeared in the 4.3BSD Tahoe 
release.) Since 386BSD 0.1 is experimental software, we recommend that it be run with a dedicated disk 
on a dedicated system, until it matures. 

Disk space must be formatted by a utility. IDE and SCSI drives are already preformatted. ESDI controllers 
have formatting programs in ROM that can be run from a MS-DOS Debug utility, and the geometry of the 
drive can be obtained from these programs (in terms of cylinders, tracks or heads, and sectors). ESDI 


237 



drives use the last cylinder to hold bad block tables; currently these are not used, and the last cylinder must 
be ignored. In addition, since 386BSD uses its own bad block revectoring (a la DEC standard 144), sector 
sparing should not be used. 

If possible, the formatted drive should have the same low-level geometry of the hard drive itself. However, 
drives greater than 1024 cylinders are run in a logical translation mode by the disk controller to make up 
for a limitation in MS-DOS. While this logical formatting will work, the 386BSD file system will not be 
as efficient, since its clever rotational placement algorithms won't mesh with what the physical drive is 
actually doing. 

With the drive geometry information in hand, a disktab entry describing how 386BSD is to use the disk 
space must be written; Example 1 is a sample disktab entry. In general, we prefer a 5- to 10-Mbyte root 
partition, a swap partition about twice as big as the amount of RAM memory, and a /usr partition that 
contains the remainder. Each partition has a size, offset, and type associated with it. Both size and type are 
in units of sectors, and each partition is arranged to start at the beginning of a cylinder, so that the 
rotational placement algorithm won't be thrown off by a logical partition offset. We then use the disklabel 
command from the floppy-based system (analogous to fdisk) to install a machine-readable version of the 
disktab entry onto the hard disk, along with a bootstrap program. 

Example 1: A sample disktab entry. 

cp3100|Connor Peripherals 100MB IDE:\ 

:dt=ST506:ty=winchester:se#512:nt#8:ns#33:nc:#766:sf:\ 
:pa#12144:oa#0:ta=4.2BSD:ba#4096:fa#512:\ 

:pb#12144:ob#12144:tb=swap:\ 

:pc#202224:oc#0: \ 

:ph#177936:oh#24288:th=4.2BSD:bh#4 0 96:fh#512: 

Once the hard-disk configuration is completed, we must do a "high-level" formatting of the 386BSD 
partitions that will hold files. Analogous to the MS-DOS format program that creates a blank file structure 
on a hai'd disk, the newfs program is executed off the floppy and initializes the root and user partitions of 
the hard disk. 

Next, the root partition is made accessible as a file system by means of the mount command, and the 
contents are transferred from the floppy-based system to the hard-disk root file system and dismounted, 
making the hard disk bootable. The floppy-based system is shut down gracefully, and the hard-disk 
version booted in its stead. At this point, the floppy disk can now be used to load on the remainder of the 
system. We mount the empty user file system and proceed to reload the file system with the tar utility to 
extract the system from a multivolume floppy dump. 

For those with access, the boot floppy allows the restore to occur over the network, thus eliminating the 
need for extracting from floppy dumps. 

System Configuration 

All the configuration files are located in the /etc directory of the root file system, including brief notes on 
setting up the system. The elvis text editor can be used to modify them as needed. Further configuration 
and expansion can be accomplished by loading the sources from another multivolume floppy dump, and 
recompiling the system and its utilities. 


238 





386BSD documentation, including the installation procedures for the rest of 386BSD (binary and/or 
source), should be available as part of the online 386-BSD release. If it is not online, contact the sysop or 
moderator to have it installed, and send e-mail via CompuServe at 76703,4266. 

Perspective: The Importance of Thinking Small 

In the spirit of "running light without overbyte," 386BSD is a minimalist system. This approach has 
allowed us to easily discuss the important paradigms effectively leveraged in UNIX and other modern 
operating systems. Over the course of this series, our minimalist approach has forced us to cleave to a 
basic understanding of the functional core of the system, and not get bogged down in the minutiae of 
building obscure utilities. It has also provided us with a bit of "editorial" oversight on the contents of our 
system. Occasionally, it is as necessary to discard an item as it is to create one—otherwise, we would be 
"hip deep" in relics held past their prime. In addition, a minimalist system is an excellent educational and 
training platform in the teaching of operating systems, net-working, file systems, C++, and software 
management. 

Another virtue of our minimalist approach is that the sizes of the source and the operating binaries was 
greatly reduced in bulk. Paring down redundant utilities and source code increases ease of use without loss 
of functionality. 

Where is 386BSD Heading 

In subsequent releases, we expect 386BSD to grow in both size of distribution and stature of function 
quality, without losing its minimalist design. Some topics now ripe for exploration are as follows: 

• Many UNIX systems have adjusted only "fitfully" to life on the PC; it’s almost as if they were 
immiscible from the very start. Yet many good features developed in MS-DOS and Windows are 
missing from the UNIX paradigm. It’s a shame that parochial attitudes keep UNIX systems architects 
from leveraging the largest programming environment present in the world today. 

• File I/O and networking transmission rates seem to be the limiting factor in UNIX systems 
performance, especially as the audio/video data demands of multimedia are stalling to become 
significant. The problem is not with the hardware, but with the software and overall architectures. As 
such, PCs and workstations provide actual data transfer rates at only a fraction of the ten-Mbyte per 
second possible with state-of-the-art hardware technologies. It’s hard to believe that, while we’re on 
the verge of 100-MIP processors, most PCs will be transferring data at less than an order of 
magnitude faster than the original, wheezing 8-bit PC. 

• Light-weight processes formed in a tiny fraction of a millisecond are a necessary component to 
experimentation with new programming architectures. Multiprocessor versions of these processes 
should allow extensions into the time and space domains of multi-threaded models. 

• The ability to explore the file-system metaphor without the need for kernel programming is an 
interesting challenge. File systems are a popular area of study these days, because they are an ideal 
vehicle for exploring system performance (bandwidth), integrity (file stability and recovery), 
distributed systems (locality and replication), and the central universal abstraction of the applications 
program interface (like that in Plan 9; see DDJ, January 1991). 


239 



Farewell to the Porting Series-Onward to New Topics 


With the completion of 386BSD and its widespread availability, it’s time to bid farewell to our "Porting 
UNIX to the 386" series. After 17 installments (and, believe it or not, quite a few shortcuts), we actually 
finished our port, and we are happy that people can finally use the system about which we have spent so 
long writing. 

With each new version of 386BSD, we hope to see it become more affordable, available, accessible, 
modifiable, and understandable. 386BSD still has quite a way to go towards becoming a mature operating 
system, but already it has traveled the "useful" portion of the distance. 

As such, we intend to explore topics such as networking, which impact not only 386BSD, but modern 
operating systems in general. While we may revisit 386BSD in our discussions, it is important to view it in 
the context of other modern operating systems approaches on the UNIX side (Plan 9, Mach, Minix, and 
the like) and in the broader commercial domain (MS-DOS, Windows-NT, OS/2 2.0, and so forth). 

386BSD is really a microcosm of what these "big" operating systems are all about. 

Given support and encouragement, we also intend to continue the educational and research direction upon 
which 386BSD was based, and we will continue to assist other groups who wish to head in this direction 
with us. However, the growth of 386BSD and discussion of new approaches depends on the continued 
goodwill and enthusiasm of its user base. Everyone is welcome to participate in this process. 


240 



